William Ting is a longtime FOSS advocate with contributions in various projects (Pelican, autojump, pyramid_swagger, Rust, GNOME). He's currently an infrastructure engineer at Reddit, and previously on the Yelp Transaction Platform team.
Self-Healing Systems: The Road to 99.99% Uptime
Scalable Python, Intermediate
Stop firefighting and start fireproofing! There are many tools that make oncall easier and increase availability, but we'll be mostly focusing on a few principles and design patterns that help make your systems more robust.
Feature velocity is typically a higher priority early in a software's lifecycle, but as the system matures there is an effort to start fireproofing the system. On the Yelp Transactions Platform team we've used a combination of circuit breakers, queues, and idempotent operations to minimize downtime and waking up in the middle of the night.
We'll take a look at how these design patterns help us in a distributed system, when they should be used, and common pitfalls associated.