TimeoutError and TransportException, DNS resolution timed out

Question

Hi,

We're using Azure CosmosDB with the SQL API from the NodeJS SDK. Everything worked fine until last weekend, but since yesterday some of our requests take much longer than usual and occasionally they fail with this error:

2020-09-01T08:06:59.618025438Z: [INFO]   [Nest] 35   -  09/01/2020, 8:06:59 AM    [ExceptionsHandler]  Object: 
2020-09-01T08:06:59.618075140Z: [INFO]  {
2020-09-01T08:06:59.618083440Z: [INFO]    "code": "TimeoutError",
2020-09-01T08:06:59.618090240Z: [INFO]    "name": "TimeoutError",
2020-09-01T08:06:59.618096541Z: [INFO]    "headers": {
2020-09-01T08:06:59.618102941Z: [INFO]      "x-ms-throttle-retry-count": 0,
2020-09-01T08:06:59.618109341Z: [INFO]      "x-ms-throttle-retry-wait-time-ms": 0
2020-09-01T08:06:59.618115641Z: [INFO]    }
2020-09-01T08:06:59.618121642Z: [INFO]  }

In some other cases we see the following error instead, or in combination with the one above:

2020-09-01T08:06:55.612181636Z: [INFO]  ResponseTime: 2020-09-01T08:06:11.8461009Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-westeurope1-fd5.documents.azure.com:14320/apps/d060e46a-6fa1-445e-86fe-347c0cf24dcc/services/521d14e3-9187-41c8-a6b3-a4ee05942f36/partitions/f7f8cddc-ca3b-4c9c-898c-74a99268e71c/replicas/132430956095294498p/, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 410, SubStatusCode: 0, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: A client transport error occurred: DNS resolution timed out. (Time: 2020-09-01T08:06:11.8461009Z, activity ID: 8ff21530-10d9-4d8a-a41d-448268c341fa, error code: DnsResolutionTimeout [0x0004], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd5.documents.azure.com:14320/, connection:  -> rntbd://cdb-ms-prod-westeurope1-fd5.documents.azure.com:14320/, payload sent: False, CPU history: (2020-09-01T08:05:18.7064249Z 96.456), (2020-09-01T08:05:28.7363304Z 97.707), (2020-09-01T08:05:44.1162586Z 97.891), (2020-09-01T08:05:48.6462226Z 96.816), (2020-09-01T08:05:59.1062371Z 93.360), (2020-09-01T08:06:08.6761135Z 99.130), CPU count: 40), ResourceType: Document, OperationType: Create

We also experience these issues when not using the SDK at all but only accessing the database through the azure portal.

Both errors (at least in my opinion) indicate an infrastructure issue, but neither the azure status page (status.azure.com) nor the service health page within the portal show any signs of a known issue. However, it's hard to believe that we're the only ones facing this with all of our databases since yesterday (except if there would be some kind of misconfiguration on our end).

Does anybody experience the same? Or can a MSFT member maybe shed some light on this?

Thanks in advance!

Update #1 (2020-09-04):

We've raised a support ticket and received a hint, "that the issue is due to high CPU usage. [...] The error message shared by you in the ticket indicates that the client is running at high CPU. [...] This level of CPU starvation is guaranteed to cause TCP connections to fail at the client, or in this case DNS resolutions failures. Can you please make sure that the client is not running at high CPU during your workloads?".

The first part of this reply makes perfect sense. You can read the high CPU values in the second error message posted above: "CPU history: (2020-09-01T08:05:18.7064249Z 96.456), ..." however, I don't really think that these values refer to the client but rather to the database itself (hint: the "CPU count: 40" after the CPU history, I'm pretty sure our client is not running on that many CPUs).

Accepted Answer

I can confirm that our CosmosDB was impacted by a service issue (9VVY-VZ0), according to which users in West Europe "may have experienced timeouts and performance issues when attempting to use SQL API queries in Gateway mode". There was an "instance of a backend service [that] encountered a performance issue after reaching an operational threshold, preventing requests from completing".

The engineers "migrated the impacted resources to a healthy instance", which solved this issue in our case. The errors are not appearing anymore.

Share via

TimeoutError and TransportException, DNS resolution timed out

0 additional answers

Your answer