Summary

Between 10/16/23 19:00 UTC (the first observed issue) and 10/19/23 04:30 UTC (a fix deployed), during intermittent periods of time, some client requests experienced timeouts with a 5xx error status code. Overall, 0.59% of requests resulted in a 500-level response status.

Timeline

Root Cause

A bug introduced the possibility of deadlock in our primary database write pools.

During the indexing of new recordings, the service performs many updates within a single transaction. We added tracking for a site’s “last seen” time via a throttled updater service called within a transaction, but using its own connection. When the write pool on a pod filled up with index processing transactions and the “last seen at” updater attempted its update, there were no available connections left in the pool and no requests on the pod could make progress.

Steps We Are Taking

In addition to deploying a fix for the identified cause, we have improved our application test suite to identify additional errors of this type. As noted above, we have updated our application metrics, monitoring, and readiness check to look for similar problems in the future.