Hi,
We're using Azure CosmosDB with the SQL API from the NodeJS SDK. Everything worked fine until last weekend, but since yesterday some of our requests take much longer than usual and occasionally they fail with this error:
2020-09-01T08:06:59.618025438Z: [INFO] [Nest] 35 - 09/01/2020, 8:06:59 AM [ExceptionsHandler] Object:
2020-09-01T08:06:59.618075140Z: [INFO] {
2020-09-01T08:06:59.618083440Z: [INFO] "code": "TimeoutError",
2020-09-01T08:06:59.618090240Z: [INFO] "name": "TimeoutError",
2020-09-01T08:06:59.618096541Z: [INFO] "headers": {
2020-09-01T08:06:59.618102941Z: [INFO] "x-ms-throttle-retry-count": 0,
2020-09-01T08:06:59.618109341Z: [INFO] "x-ms-throttle-retry-wait-time-ms": 0
2020-09-01T08:06:59.618115641Z: [INFO] }
2020-09-01T08:06:59.618121642Z: [INFO] }
In some other cases we see the following error instead, or in combination with the one above:
2020-09-01T08:06:55.612181636Z: [INFO] ResponseTime: 2020-09-01T08:06:11.8461009Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-westeurope1-fd5.documents.azure.com:14320/apps/d060e46a-6fa1-445e-86fe-347c0cf24dcc/services/521d14e3-9187-41c8-a6b3-a4ee05942f36/partitions/f7f8cddc-ca3b-4c9c-898c-74a99268e71c/replicas/132430956095294498p/, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 410, SubStatusCode: 0, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: A client transport error occurred: DNS resolution timed out. (Time: 2020-09-01T08:06:11.8461009Z, activity ID: 8ff21530-10d9-4d8a-a41d-448268c341fa, error code: DnsResolutionTimeout [0x0004], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd5.documents.azure.com:14320/, connection: <not connected> -> rntbd://cdb-ms-prod-westeurope1-fd5.documents.azure.com:14320/, payload sent: False, CPU history: (2020-09-01T08:05:18.7064249Z 96.456), (2020-09-01T08:05:28.7363304Z 97.707), (2020-09-01T08:05:44.1162586Z 97.891), (2020-09-01T08:05:48.6462226Z 96.816), (2020-09-01T08:05:59.1062371Z 93.360), (2020-09-01T08:06:08.6761135Z 99.130), CPU count: 40), ResourceType: Document, OperationType: Create
We also experience these issues when not using the SDK at all but only accessing the database through the azure portal.
Both errors (at least in my opinion) indicate an infrastructure issue, but neither the azure status page (status.azure.com) nor the service health page within the portal show any signs of a known issue. However, it's hard to believe that we're the only ones facing this with all of our databases since yesterday (except if there would be some kind of misconfiguration on our end).
Does anybody experience the same? Or can a MSFT member maybe shed some light on this?
Thanks in advance!
Update #1 (2020-09-04):
We've raised a support ticket and received a hint, "that the issue is due to high CPU usage. [...] The error message shared by you in the ticket indicates that the client is running at high CPU. [...] This level of CPU starvation is guaranteed to cause TCP connections to fail at the client, or in this case DNS resolutions failures. Can you please make sure that the client is not running at high CPU during your workloads?".
The first part of this reply makes perfect sense. You can read the high CPU values in the second error message posted above: "CPU history: (2020-09-01T08:05:18.7064249Z 96.456), ..." however, I don't really think that these values refer to the client but rather to the database itself (hint: the "CPU count: 40" after the CPU history, I'm pretty sure our client is not running on that many CPUs).