Contributors: Adrian Luff, Felix Kastner, and James Hartshorn
Twitch is the world’s largest live streaming platform for individuals. There are many ways to view Twitch, including desktop browsers, mobile devices, game consoles, and TV apps. The Client Delivery Platform team owns the infrastructure that delivers Twitch clients to users. Last year, we designed next-generation availability defenses for one of our critical microservices that raised availability from 99.9% (3 9s) to 99.99% (4 9s). In this post I’ll share our design, guiding principles, and results.
Whenever you design high availability for a service, you should take into account both the common considerations (that apply to any cloud service) as well as specific opportunities for your service. The basics of cloud high availability are well known, such as using redundancy to avoid single points of failure. We’ll be talking about what we’ve done beyond the basics.
We developed the following high availability principles for our service:
30 million people visit www.twitch.tv on a typical day. In response, the Twitch web platform, a single page application, is delivered to the browser by a microservice named Sage, in partnership with a CDN. Sage’s job is to serve up index.html. That sounds pretty simple, but Sage also supports canary releases, A/B testing, policy enforcement, and search engine metadata retrieval. If Sage fails to do its job, the user doesn’t reach Twitch and can’t enjoy live video, chat and everything else Twitch offers. Once the web platform is loaded, Sage is out of the picture and other services support the viewing experience. Therefore, we care about high availability for Sage because you only get one chance to make a first impression.
Let’s define some metrics. We’re measuring availability as the error-free rate of the Sage service load balancers, reported weekly. A related metric is deliverability, the error-free rate from CDN to browser. Availability measures the service, while deliverability measures the actual customer experience.
Sage runs in over a dozen AWS data centers around the world, fronted by hundreds of CDN points-of-presence. It’s always been highly over-provisioned as a safeguard against denial of service attacks and fluctuations in activity. Sage was one reliable service you never had to worry about—until the day we did. Last year, it came to our attention that Sage’s availability had dropped slightly below 99.9% one day in March. 3 9s of availability is a minimum expectation of all critical Twitch services. When that happened a second time, we took the short-term action of tripling servers in the affected regions while we performed a deep analysis and decided what to do on a long-term basis.
Our analysis revealed several opportunities for improvement, and our engineers proposed a number of innovative ideas for increasing resilience they had either prototyped or were eager to try. We believed we could obtain an order of magnitude improvement, and proceeded to develop next-generation availability defenses for Sage.
The chart below shows what a denial of service attack on Sage looks like in one region: a burst of requests that are multiples of normal load. Surges are usually handled with ease, but if capacity is exceeded, some users will start receiving 5xx errors. Distributed denial-of-service (DDOS) attacks hit multiple regions simultaneously. They are often over and done within just a few minutes’ time, but are sometimes prolonged. Surges due to a popular event can last hours or even days.
Sage Requests during a Denial of Service attack
Not all availability threats are hostile. Traffic can increase for perfectly innocent reasons, like growth, or surge during an unanticipated, wildly popular, event. Regardless of intent, a challenge to availability means war!
Medieval castles are a good analogy for visualizing availability defenses, so we’ll be relating our principles to castle defense. While that might seem more appropriate for a security discussion, you can’t separate availability from security, and many tools for security design work equally well for availability. Chief among them is threat modeling: We first work backwards from service disruptions, identifying what can cause or contribute to a failure. Then we design mitigations for those vulnerabilities.
Example Attack Tree
Castles practiced defense-in-depth, the technique of layering multiple layers of defense to wear down and demoralize attackers. Some of the defenses were obvious, like the high, thick stone castle walls. Others were hidden and kept secret until needed. As attackers made it past one formidable defense, they ran into another one. Penetrating a castle meant overcoming an overwhelming number of defenses that could include natural features, the surrounding town, moat, drawbridge, portcullis, archers, battlements, murder holes, outer walls, inner walls, and the keep tower. Defense-in-depth was effective in medieval times, and it remains a respected defense strategy today.
Layered Defenses for a Medieval Castle
Defense-in-depth is essential for availability, since we can’t rely on any one mechanism to fully protect us. We leverage multiple, redundant layers of defense for superior protection. When an attacker penetrates one layer of defense, there’s another one behind it—wearing down the attacker’s resolve and resources to the point where further attack is not worth it. Each of our defense mechanisms are part of a defense-in-depth mosaic.
Layered Defenses for Sage
Before attackers reached a castle, they had to first get past the outer defenses. Castles took advantage of natural features of the surrounding area such as cliffs and rivers. Castles situated on hills or mountains could more easily detect invaders, and it was harder for the enemy to move soldiers and weapons into position. The surrounding town was an integrated part of the defenses. The townspeople were expected to both defend the castle and sound the alarm about invaders. The water-filled moat surrounding the castle discouraged tunneling.
The CDN is our first opportunity to recognize hostile traffic and do something about it. For Sage, the CDN recognizes and filters known unwanted bots while passing on friendly bots such as search engine crawlers. During a prolonged attack, if we’re able to identify distinguishing characteristics of attacker traffic, we can configure the CDN to intercept it.
When a castle was under siege, the attackers isolated the castle, cutting off their food and water supply. Well-supplied castles had the reserves to wait out a lengthy siege. The 1226 siege of Kenilworth Castle lasted 6 months before the occupants ran out of food and surrendered. Cities could hold off even longer: the 1648-1669 Siege of Candia lasted 21 years!
A service owner must likewise consider what to do when starved of resources. Our senior systems engineer, James Hartshorn, devised a facility we call Sage Origin Backup. Sage and the CDN don’t normally cache index.html because new releases are frequent. However, we do cache the latest release of index.html for failover purposes. When Sage is unavailable and responds to a request with a 5xx error, the CDN steps in and serves a stale index.html in its place. The user is completely unaware of the service disruption and isn’t affected by it. Although bypassing Sage means doing without the service’s business logic, it’s only for a short while and the all-important task of serving up index.html is preserved.
Sage Origin Backup
To measure the effectiveness of this, we track a metric called Deliverability, which tells us how often customers receive 200 responses vs. 5xx errors from the CDN. Most weeks we hit 5 9’s of Deliverability thanks to this remarkably effective mechanism.
The most obvious layer of castle defense is its thick stone wall, also called a curtain wall. Everyone knows the importance of castle walls. It’s your standing defense, the foundation of what protects you. Visually intimidating, the wall was a constant reminder to would-be attackers of the castle’s integrity.
The service equivalent of a castle wall is our provisioning level, the amount of capacity in each region. When an attack comes, suddenly and without warning, our standing defense needs to be adequate. We allocate capacity not for standard load but for standard attack load.
Sage had traditionally been running on hundreds of servers in over a dozen AWS data centers around the world, all highly over-provisioned to guard against denial of service attacks. Each region had the same capacity. While that arrangement had been working well for years, some things had changed over time. Twitch traffic had grown: were we still adequately provisioned? How much did load vary around the world? Were we even in the right regions?
We sought to better understand our traffic and load. Analysis revealed a wide variance in regional load: our busiest region saw 24 times the traffic of our least-busy region! Moreover, some regions were more prone to denial of service attacks than others. That meant our “thick skin” of protection was not as thick as we thought it was everywhere. It was also uncomfortable to have 45% of all traffic routed through just 2 data centers: while there was no capacity concern, a full regional outage could impact a large portion of our users.
Previous Regional Load Share
We studied our traffic frequency distribution to understand the peak load each region was subjected to. The resultant heap map of requests per hour by region allowed us to profile each region’s load. We also studied requests by minute to understand the intensity of surges.
Requests Per Hour Frequency Distribution By Region
Equipped with these insights, we moved Sage from a fixed provisioning model to a proportional provisioning model. We adjusted our regional allocations based on measured load, times an overprovisioning factor. We also added more regions in Europe to distribute the load. As a result, Sage now has very different regional allocations but an equal amount of protection in each region. Our proportional provisioning approach ensures each region is adequately protected while not being wasteful. Despite adding new regions, monthly costs increased only 1.5% because this redistribution of provisioning reduced allocations in many regions.
After the 1300s, a well-defended castle had two walls. If invaders breached the outer wall, there was another, higher one to contend with. Defenders would cascade arrows and hot tar on attackers, who were caught in the narrow area between the walls.
If standing provisioning is your outer wall, added capacity is your inner wall. No matter how well you’re provisioned, all it takes to exceed that is a sufficient level of traffic. Your added capacity can come from auto-scaling, from manually increasing capacity, or from scheduled capacity increases when you anticipate a change in load. It’s tempting to think auto-scaling is all you need here, but that has its limitations. Auto-scaling can bring up new instances in minutes, but a DDOS attack could be completely over in that time and have disrupted your availability. Auto-scaling is still plenty useful for longer-term load challenges, such as an unexpected increase in traffic or seasonal changes in load.
A popular event on Twitch can attract a million or more viewers. We’ll often receive an advance scaling advisory about such events and can increase capacity if we think it necessary. However, some traffic increases are surprises and we have to be ready for them. For Sage, we’re confident in our provisioning levels and service efficiency to weather short-term load surges. During prolonged load challenges, we’ll temporarily increase capacity once a certain threshold of service errors is reached. Any service errors that transpire before new capacity comes online are covered by the CDN’s Sage Origin Backup feature and don’t reach the customer.
Efficiency played a major role in medieval warfare. The arrival of the longbow made a huge difference in the Hundred Years War, because it was faster to shoot and penetrated further, while also being easier and less costly to make.
Sage, a C# .NET Core service that runs on Linux, has always been regarded as an efficient service. Our senior software engineer, Felix Kastner, saw opportunities to gain further efficiency by eliminating redundant software layers and applying response compression in chunks. He theorized these optimizations would improve performance by at least 100%. The actual performance improvement was 5000%. This stellar 50-fold improvement in performance meant we could improve Sage’s capacity while reducing cost: with each instance able to carry more load, fewer are needed.
Medieval warfare was more sophisticated than is popularly depicted, and it required ongoing innovation to keep up with changing times. New weapons like the longbow and the cannon were highly disruptive, and had to be met with new defenses and strategies. City-states hired the best to devise weapons and defenses. None other than Leonardo da Vinci was employed as a military architect and engineer, where he invented war machines and defenses for Venice and Cesare Borgia.
Service availability requires the same commitment to innovation. Attackers don’t stand still, and neither can your defenses. The Sage Origin Backup innovation described earlier made 5 9’s of Deliverability possible. The Sage optimization innovation allowed us to increase capacity while reducing cost. We review operations activity weekly, asking ourselves what needs to be better. When a disruptive incident occurs, we fully analyze the cause of error and what needs to happen to avoid a recurrence.
When Sage was originally provisioned, there were far fewer AWS regions to choose from. During our analysis and proportional provisioning work, we had the opportunity to add regions in Europe. Our main motivation in adding new regions was reducing the burden on our busiest region, but since we had a choice of new regions we selected them strategically, hoping to see latency improvements. We added Sage presence in AWS Paris in order to improve the viewer experience in France, known to have good connectivity internally but poorer connectivity to other countries. We added presence in AWS Stockholm to be nearer Russia, a country with a large Twitch viewership. The latency impact of these changes was significant: latency in Europe decreased by 27%, and Russian latency decreased by 52%.
This illustrates the value of jointly considering areas like availability, latency, and security rather than treating them as isolated topics.
Treating high availability like security allows you to use threat modeling and other security tools to identify vulnerabilities and design defenses. Practice defense-in-depth to attain superior availability coverage.
Take full advantage of your CDN to filter malevolent traffic and build failover mechanisms. Have an alternative CDN ready as well. Test your failover mechanisms so you know they work.
Overprovision as a standing defense, and be ready to add more capacity when you need it. Regularly profile load and tune your allocation. Use proportional provisioning to balance allocation levels against cost. Optimize your service for added capacity at lower cost.
Measure availability metrics and invest in alerting. Tune your alerting to avoid both false alarms and unawareness. Don’t rely on your standing defenses: have automated and manual responses prepared.
Even with good protections in place, don’t become complacent about availability. Commit to ongoing innovation and improvement. Review operations, invent improvements after incidents, and repeat.
Upon reaching a consistent 3 9’s of availability, we gained the confidence to pursue 3.5 9’s, and after that 4 9’s. That might seem at odds with the conventional wisdom that each additional 9 carries a 10X increase in cost, but it was incremental work because our availability foundation was so effective. Much of the credit for that goes to the use of efficient microservices, minimal dependencies, and failover behaviors. Not compromising the availability provided by our underlying AWS services is more than half the battle in maintaining high availability.
The Sage availability incident that triggered all of this could have been met by a shrug of the shoulders; we could have increased server allocation in a couple of AWS regions and called it a day. Instead, we decided to see if we could raise the bar. This took us on an adventure that led to treating availability with the same rigor as security, better alerting, a powerful CDN failover mechanism, a 50X increase in service capacity, lowered cost, and latency improvements. That’s a win-win in our book.
We’re building the future of live entertainment, and we’d do it even better with you. Head to our career site to learn more about what it is like to work at Twitch and how you can join our quest to empower live communities on the internet!