On September 28, from 19:30 UTC to 22:00 UTC, Foxglove experienced service degradation which resulted in high latencies on many requests to the Foxglove API. Offline visualization was not affected.
At 19:25 UTC, We received an alert of high CPU usage on the database.
At 19:30 UTC, P95 and P99 latencies had risen to 3s and 7s, respectively. There was no single cause. We noted significantly elevated request rates to the streaming service, within our rate limits. We investigated logs of slow query plans, which confirmed that there was no single cause, and that resources on the primary database were insufficient. We had already planned resource upgrades for later in the day.
At 20:00 UTC, we noted that despite high CPU usage, the primary database was able to serve more requests, and we manually increased the available application containers available to serve requests.
By 22:00 UTC, request latencies returned to normal.
At 01:00 UTC on September 29, 2023, we performed planned upgrades to database instances.
Significantly elevated request rates put pressure on our primary database before our planned maintenance window.
We are continuing to improve monitoring of database usage and request latencies. We will continue to perform database upgrades as needed through maintenance windows announced at https://foxglovestatus.com.