[CH1.0] Intermittent “Couldn’t Resolve Address” Error When Resolving Against VPC Internal DNS Server

SYMPTOM

Scenario: application deployed to CH1.0 and DNS query is sent to VPC internal DNS server.

- CH application sends DNS query towards the configured internal DNS server.
- this query will traverse one of the private connections (e.g. VPC peering, Anypoint VPN, Direct Connect, TGW) to enter customer's network.
- DNS query arrives on customer's network and internal DNS server.

CH application intermittently returns below errors.

HostName 'example.com' could not be resolved (example.com: Temporary failure in name resolution)

Message               : HTTP GET on resource 'https://example.com:443/v1/hello/world' failed: Couldn't resolve address.

CAUSE

If the issue is intermittent, it may be caused by (but not limited to) the following:
1. Loss of network connectivity between CH application and target internal DNS server. For example, the Anypoint VPN connection is down.
2. The target DNS server is returning an empty or incorrect response, which causes it to negatively cache the result. Therefore subsequent queries will fail until the negative cache expires.

SOLUTION

1. Confirm if the timing of these errors coincide with any network outage. If yes, resolve the intermittent network outages and confirm if DNS errors still occur.
2. If no network outages have been identified, then confirm if the DNS server is still responding during these errors.
a) Is the DNS server still responding?
b) If no, proceed to troubleshoot why the DNS server did not respond.
c) If yes, confirm if the DNS response contains the expected information.

Example 1
- The DNS response returns NXDOMAIN or no IP addresses. This will cause a negative cache issue.
- The customer's internal DNS server is intermittently returning an unexpected response (e.g. NXDOMAIN) instead of the expected IP addresses
- This causes the Mule application to negatively cache this empty response. Note that the negative cache time is determined by the domain's SOA record (e.g. 30 minutes).
- Once the DNS issue has been resolved, the Mule application can be rebooted to clear the DNS cache, otherwise the DNS queries will continue to fail until the negative cache expires.

We can confirm by running a tcpdump (packet capture) on the Mule application worker (engage MuleSoft support to do this) and also on the customer's edge network device and DNS server. This will allow us to confirm if a response is received and also see the contents.

Example 2
- The DNS query is sent to the customer's DNS server, which has a CNAME record. The CNAME record has a TTL of 1 hour.
- Typically, the customer's DNS server is expected to perform a recursive query and obtain the IP addresses (A record, with 30 seconds TTL), however there is a problem with the recursive query and it fails. Therefore the customer's DNS server sends a response with the CNAME record without any IP addresses.
- The Mule application will cache the CNAME result for 1 hour (based on the advertised TTL) instead of the usual A records (for 30 seconds).
- This will result in a DNS failure until the cached CNAME result expires (1 hour in this example).
- To mitigate this scenario from occurring, set a lower TTL for the CNAME record.

Other known issues:
Unable to reach internal URL or VPC special domain from CloudHub worker and getting Couldn't resolve address (UnresolvedAddressException)
Couldn't Resolve Address Due to On-Premise Network Routing Issues

[CH1.0] Intermittent “Couldn’t Resolve Address” Error When Resolving Against VPC Internal DNS Server

SYMPTOM

CAUSE

SOLUTION

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List