We work on the Video Ingest team, a part of the Video Platform organization at Twitch. Our team develops the distributed systems, services, media formats, and protocols that (a) acquire live streams from the community of creators (“Contribution”), (b) perform real-time processing, like video transcoding on these streams (“Processing”), and (c) provide a high throughput control plane for making them available for world-wide distribution at scale (“Playback Edge”), offering an end-to-end low latency video experience.
Twitch runs a large managed network and compute infrastructure to serve our creators and viewers all over the world. As our infrastructure has scaled up significantly over the years in order to meet our community growth, we encountered a challenging question: How can we utilize our computation and networking resources as efficiently as possible to meet future demand growth at a global scale while reducing operational costs? We discuss our approach to this question in the article.
In order to provide streaming services that are both high quality and low latency for creators and viewers, Twitch maintains nearly a hundred points of presence (PoPs) in different geographic regions around the world (learn more about how our Interactive Video Service works in another blog)! Each PoP is connected to our private backbone network, and live video streams received at the PoPs are then sent to the origin data centers for processing and distribution.
Origin data centers host the resources that are needed to provide computationally intensive media transformations, like video transcoding. Twitch has a global community of creators and viewers, so we have set up origin data centers all over the world to process our creators’ live streams and then distribute them to viewers via our backbone network. Using our backbone network to deliver live video traffic allows us to achieve a high availability, low latency, and high Quality of Service (QoS) experience for our users. This lets us overcome the instability of the public Internet, especially when delivering time-sensitive data like live video. A high-level summary of our infrastructure is shown in the following diagram.
Twitch started out with a single origin data center in Silicon Valley which processed live video streams from all over the globe. All the PoPs ran HAProxy - a free and open source load-balancing software. We used HAProxy to terminate live video streams at the PoPs and then forward them to the origin data center for processing via our backbone network. This architecture worked well for us at this early stage of our evolution.
Since then, our creator and viewer populations have grown significantly, so we have built origins across the globe to increase our proximity to users, our processing capacity, and our system availability. As a result of this expansion, we had more flexibility to send live video streams to different origins for processing, but this also created a set of new challenges that did not exist in the single origin environment. We are going to discuss the challenges in the next section.
As we expanded our origin and backbone network footprint, we encountered a number of issues that led to suboptimal usage of our infrastructure.
1. Each PoP still ran HAProxy and statically sent live video streams to only one of the origins for processing. Since broadcast activities are highly cyclical (with a daily pattern), this led to inefficient utilization of our compute and network resources. For example, an origin ran at full load during the busy hours of a given geographic region but the utilization became very minimal outside of that period. The following figure illustrates the idea of how different geographic regions can impact the utilization of our infrastructure.
The broadcasters in regions A and B went live at different times, and in each region the PoPs sent live video streams to only one of the origins. This led to inefficient utilization of our infrastructure throughout the day. In a nutshell, each of the origins would need to be sized up for “regional” traffic peaks instead of our total capacity across all origins being sized up for the “global” traffic peak; the latter being much smaller than the sum of all the regional peaks.
2. Given the relatively static nature of the HAProxy configuration, it was difficult to handle an unexpected surge of live video traffic (for example, due to a popular event) or react to the system fluctuations (like loss of capacity in an origin).
3. Each origin has a heterogeneous amount of compute and network resources. It was difficult to configure HAProxy so that the PoPs would send the “right” amount of live video traffic to the origins. Additionally, ingest traffic demand can be very volatile from time to time due to popular broadcast events.
4. How can the PoPs detect a faulty origin and respond by sending new live video traffic to other working origins. This scenario is very important to achieve high availability for our streaming services. Since an origin could become faulty for a variety of reasons like an outage in a control plane or loss of backbone capacity, HAProxy was not able to detect and handle this kind of complex scenario.
5. To pave the way for our future growth and use cases, we would like to run custom software in our PoPs that enables us to support multiple media transport protocols from broadcasters to our edge.
To overcome the aforementioned challenges, better support the growing demand of live video traffic, and enable new use cases and services for our broadcasters, we decided to revamp software architecture in our PoPs and completely retire HAProxy. We developed Intelligest, a proprietary ingest routing system to intelligently distribute live video ingest traffic from the PoPs to the origins.
The Intelligest architecture consists of two components, namely Intelligest media proxy running in each PoP and Intelligest Routing Service (IRS) running in AWS. The following diagram shows a high-level summary of the Intelligest architecture.
Intelligest media proxy is a data plane component and serves as the “ingest gateway” to the Twitch infrastructure. The media proxy runs in all our PoPs, collectively called “Intelligest edge.” The media proxy terminates live video streams from broadcasters, extracts relevant properties of these streams, and uses IRS to determine what origins the streams should be sent to. In general, the media proxy can terminate any number of media transport protocols (e.g., RTMP or WebRTC) allowing us to use a “canonical” protocol of our choice within the Twitch backbone infrastructure.
Intelligest Routing Service (IRS) is the ingest routing controller and is responsible for making routing decisions for live video streams. Please note that the Intelligest architecture does not control or manage the L3 routing of the backbone network. IRS is a stateful service and can be configured to support rule-based routing to satisfy different use cases. For example, we can configure IRS in such a way that live streams associated with certain video channels are always forwarded to a particular origin no matter what PoPs the video streams arrive at. This is useful in case there are channels with special processing needs that can only be serviced in the particular origin. As another example, IRS can also route multiple related streams to the same origin for processing - such a capability can be useful in a premium broadcast scenario where there is a primary as well as a backup feed.
The above diagram illustrates a simple example. Broadcasters send their live video streams to a PoP where the Intelligest media proxy is running. The Intelligest media proxy queries the IRS service to get the routing decisions for the streams and then forwards them to the corresponding origins. The goal of the IRS service is to compute routing decisions in order to optimize a system objective. We will formulate our objective in the next section.
One of the goals of the Intelligest architecture is to better utilize the compute and network resources in our infrastructure. This translates to better system scalability and reduced operational costs.
We can formulate this objective as an optimization problem:
Because we have historical ingest traffic demand and it shows a strong cyclical pattern, the above optimization makes sense. We solve this problem offline using an optimization solver and then update the routing configuration in IRS. As a result, IRS can instruct the media proxy in each PoP to route ingest streams to appropriate origins so as to maximize the objective.
The following figure demonstrates the efficacy of Intelligest in the production setting. It shows the global compute utilization across all our origins over time. Intelligest is able to utilize 100% of the compute resources available to Twitch at the peak times.
Although this approach works reasonably well, there are still some limitations:
This leads to a natural question: what can we do to overcome these limitations? This is the focus of the next section.
In order to address the limitations discussed previously, we developed two subsystems that can respectively track the utilization of the computation and networking resources across the entire Twitch platform. Specifically, IRS is now able to query two services, namely Capacitor and The Well. Both services are implemented in AWS. Capacitor monitors the compute resources in every origin and can detect any capacity fluctuations (for example, some compute resources are offline due to maintenance), so that we can know how much processing capacity is available in each origin at any given time. The Well monitors the backbone network and provides information about the status of network links such as utilization and availability at any given time.
With the information from both services, IRS has a real-time view of our infrastructure. IRS uses a randomized greedy algorithm to compute routing decisions to solve the resource utilization problem as formulated in the previous section.
One situation where this has been very beneficial to us is origin failure scenarios. Typically, these include loss of compute or network capacity in an origin and get reflected in the Capacitor or The Well systems. As a result, IRS can automatically direct new live video streams to an unaffected origin to achieve high availability for our broadcasters. Our architecture is also extensible in order to monitor other failure scenarios like an outage in a regional control plane, so that IRS will be able to detect and respond to a variety of important failure situations.
In this article we presented Intelligest, our ingest edge architecture to intelligently distribute live video streams from edge PoPs to origins. The Intelligest architecture is helping us with resource utilization and cost efficiencies as well as operational and availability improvements at large scale.
This architecture provides several other benefits as well. For example, given that Intelligest media proxy uses a modular software architecture in its implementation, it is easily extensible to provide a new streaming capability to our broadcasters. In such a case, a developer need only write a new “module” and plug it into the media proxy. This modular framework accelerates time to market for new streaming capabilities for our community and provides a unified production environment for running them. Our current roadmap involves developing new and exciting live media capabilities that are based on the Intelligest architecture.
Additionally, the Intelligest architecture also provides the capability to enable traffic prioritization at our network edges. Intelligest edge can mark live video streams with an appropriate priority, based on their properties. Then, our network backbone can utilize the L3 Differentiated Services capability to deliver video streams based on their set priorities. This makes Twitch’s infrastructure even more flexible in providing different services with the appropriate QoS characteristics, and positively impacts the viewer experience.
This is just one project on our quest to improve Twitch’s cutting-edge live video streaming technology. With our ever-increasing scale, newer and upcoming live media capabilities, our team is always finding new challenges to tackle to continue improving the video experience for our community.
Want to Join Our Quest to empower live communities on the internet? Find out more about what it is like to work at Twitch on our Career Site, LinkedIn, and Instagram, or check out our Job Openings and apply.