Foxglove’s Incident on 9-28-23

Summary

On September 28, from 19:30 UTC to 22:00 UTC, Foxglove experienced service degradation which resulted in high latencies on many requests to the Foxglove API. Offline visualization was not affected.

Timeline

At 19:25 UTC, We received an alert of high CPU usage on the database.

At 19:30 UTC, P95 and P99 latencies had risen to 3s and 7s, respectively. There was no single cause. We noted significantly elevated request rates to the streaming service, within our rate limits. We investigated logs of slow query plans, which confirmed that there was no single cause, and that resources on the primary database were insufficient. We had already planned resource upgrades for later in the day.

At 20:00 UTC, we noted that despite high CPU usage, the primary database was able to serve more requests, and we manually increased the available application containers available to serve requests.

By 22:00 UTC, request latencies returned to normal.

At 01:00 UTC on September 29, 2023, we performed planned upgrades to database instances.

Root Cause

Significantly elevated request rates put pressure on our primary database before our planned maintenance window.

Steps We Are Taking

We are continuing to improve monitoring of database usage and request latencies. We will continue to perform database upgrades as needed through maintenance windows announced at https://foxglovestatus.com.