Foxglove’s Incident on 9-27-23

Summary

From September 27, 2023, 15:30 UTC to September 28, 2023, 06:45 UTC, Foxglove experienced service degradation which resulted in high latencies on many requests to the Foxglove API. Offline visualization was not affected.

Timeline

At 15:30 UTC, P95 and P99 request latencies began to spike.

At 16:44 UTC, we were alerted to an elevated backlog of data imports.

At 17:27 UTC, we began to receive alerts about high CPU usage on our primary database. The immediate cause was a high number of writes to Sites during recording import.

At 19:05 UTC, we deployed a fix for the CPU spike.

At 19:41 UTC, we received a report of timeouts to the stream service. We tracked this down to a query that had become inefficient in some cases, and investigated alternative query plans.

At 20:30 UTC, we began an index build to address the inefficient stream service query. CPU on the database was near 100% during indexing, which took several hours.

At 21:00 UTC, P99 exceeded 30s for a short time.

By 22:30 UTC, P50, P95, and P99 request latencies all began to increase steadily. We observed an increase in data ingestion rates from a single source, which had scaled up to work through a backlog of recordings.

By 02:00 UTC, P95 latency exceeded 20s. The supporting index was still building, and import volumes remained high.

At 05:29 UTC, (1:29 ET) The index build finished, and data ingestion rates improved.

By 06:45 UTC, the platform had worked through ingestion backlogs and latencies returned to normal levels.

Root Cause

Because of high data ingestion rates sustained over long periods of time (reaching 70k/hr from a single source), application connection pools became saturated with write requests, and the API’s ability to serve read requests degraded. High CPU usage on the database slowed down ingestion rates, lengthening the duration of this incident.

Steps We Are Taking

We announced planned maintenance and upgraded database resources on September 28. We have since separated the application’s read and write pools, and are adding additional monitoring for pool waiting. We have also added monitoring specifically for P95 request latencies.