On June 28, 2025, a critical issue caused all of our services to become unavailable for roughly 24 hours.
To be clear: this incident was isolated to our usage of the database provider Neon and did not affect their other customers or services.
Here's what happened, what we've learned, and how we're making sure it doesn't happen again.
What Caused The Outage
Everything was operating normally until an unexpected spike in internal compute usage pushed us past a key usage threshold with Neon.
That spike wasn't caught early, and once the quota was exceeded, database access was automatically restricted.
Why Everything Went Down
When the databases under our organization went offline, so did everything else that depended on it. Since every critical service of ours relies on a database under the same organization, this caused a total shutdown across all services.
Why It Took 24 Hours To Recover
Even after fixing the underlying issue, we weren't able to restore service right away. That's because the quota in question resets only at the end of the current billing cycle, not on a rolling or hourly basis.
We used that waiting period to ensure all systems would function when the quota reset.
What's Changing
This incident exposed some key areas we need to improve, and we've already taken several steps to prevent this from happening again:
- Usage monitoring with alerts across all infra
- Internal ratelimits to catch runaway workloads
- Early work on database fallback paths & backups
Thanks for Your Patience
Outages are never acceptable, and we appreciate your trust while we work to build a more resilient system. If you were impacted or have questions, you can reach us any time at management@secton.org.
— The Secton Team