Foxglove’s Incident on 9-5-23

Foxglove experienced a partial outage on September 5, 2023, during which time many users were unable to use Foxglove’s online data visualization features and most Foxglove API endpoints.

Summary

From 15:42 to 17:55 UTC, Foxglove experienced a service outage which resulted in HTTP timeouts to most Foxglove API requests. Offline visualization was not affected.

Timeline

15:42 UTC – We began to receive alerts about high CPU usage on our primary database.

15:55 UTC – We had an elevated rate of 524 response status from our Load Balancer, and our API's ability to serve traffic was degraded.

16:09 UTC – Using slow query logging we identified a particular query taking multiple seconds to run, and being called 3–5 times per second, exhausting our available database connections. We killed the slow queries, resulting in a CPU drop, but only temporary as more requests came in.

16:15 UTC – We notified customers via foxglovestatus.com of an ongoing incident.

16:24 - 16:43 UTC – A coincident issue at GitHub prevented us from issuing fixes or running CI to address the source problem directly. To mitigate the issue, we increased the number of connections available, but these were quickly occupied by the scripted requests.

17:00 UTC – GitHub was available, and we merged ready fixes to address the problem. The fix required us to first index our recordings column with a b-tree index, which proceeded slowly because of the already-high load on the database.

17:20 UTC – We temporarily disabled the API key of the script triggering the slow query, and notified the customer in question.

17:50 UTC – API service was beginning to return to normal levels.

17:55 UTC – Our 524 error rate had dropped to zero.

18:15 UTC – We merged the API fix to remove the slow query and at 18:40, we re-enabled the API key in question.

Root Cause

The slow query identified at 16:09 UTC was performing a like comparison, with full recording paths as inputs. While the column had the correct trigram index, the inputs exceeded 200 characters and consumed all available CPU to perform the matching.

This behavior was introduced as a convenience for our web console, but our API documented an exact match in this case, which sufficed for the requests in question.

Next Steps

We are implementing better internal procedures to notify customers in multiple places, including community and private support Slack channels.

We are also implementing additional alerting policies to improve our system visibility, and be able to take action sooner when an incident occurs.