Summary
Between 10/16/23 19:00 UTC (the first observed issue) and 10/19/23 04:30 UTC (a fix deployed), during intermittent periods of time, some client requests experienced timeouts with a 5xx error status code. Overall, 0.59% of requests resulted in a 500-level response status.
Timeline
- 10/16, 19:38 UTC — We were alerted to elevated 5xx response codes and high p95 latency
- 10/16, 20:08 UTC — We observed rising counts of open file handles on a single application pod, and killed the pod in question. This recurred twice over the next 40 minutes.
- 10/16, 21:20 UTC — We confirmed that the open handles were not database connections. We suspected that the introduction of new error reporting software may have been leaking file handles and reverted that commit.
- 10/16, 23:34 UTC — We were again alerted and observed the same symptoms; we again killed the problematic pod. This recurred twice for the next 40 minutes.
- 10/17, 16:17 UTC — We deployed a configuration change to our primary database to guard against transactions stuck in an idle timeout state.
- 10/17, 19:44 and 23:39 UTC — We received additional batches of alerts, and killed the problematic pods.
- 10/17, 19:45 UTC — We deployed additional metrics for connection states in database connection pools.
- 10/17, 20:12 UTC — We merged a change to automatically remove a pod from service when idle connection counts reached a threshold.
- 10/18, 21:30 UTC — We observed another batch of occurrences, and problematic pods were automatically removed from service.
- 10/18, 10:30 UTC — We increased our minimum pod count as a mitigation.
- 10/19, 04:08 UTC — We identified the suspected root cause and deployed a fix.
- We continued to monitor and did not observe the issue after the above fix was deployed
Root Cause
A bug introduced the possibility of deadlock in our primary database write pools.
During the indexing of new recordings, the service performs many updates within a single transaction. We added tracking for a site’s “last seen” time via a throttled updater service called within a transaction, but using its own connection. When the write pool on a pod filled up with index processing transactions and the “last seen at” updater attempted its update, there were no available connections left in the pool and no requests on the pod could make progress.
Steps We Are Taking
In addition to deploying a fix for the identified cause, we have improved our application test suite to identify additional errors of this type. As noted above, we have updated our application metrics, monitoring, and readiness check to look for similar problems in the future.