Twitch State of Engineering 2023
A Note From Our CTO Christine Weber
From day one, Twitch’s mission has been to empower live interactive communities. Whether they’re gaming, just chatting, writing music, or cooking, creators around the world have found audiences that support their passions and share their unique blend of interests. And anyone who’s tuned-into a stream knows that the real magic happens in the moment-to-moment, never to be repeated interactions between streamers and their communities. Behind the scenes, our teams work hard to make sure streamers and viewers have a high quality experience. This blog will focus on many (but not all!) of the engineering teams that create the tools and services that help streamers and communities connect 24/7, 365 days a year. We’ll cover things like chat experience that actually flows like a conversation, and our video distribution systems, which let you watch from anywhere at any time.
Many of these tools have grown and scaled to match the substantial growth of our community:
- Our average number of active viewers is over 2.5 million
- Our yearly number of minutes watched is over 1.3 trillion
- Monthly individual streamers are 7 million
There are many tools, services, and people that come together to ensure Twitch works seamlessly, and we’re going to talk about a lot of them. We’ve organized this blog by team, and when possible, linearly, so you can get a better understanding of how Twitch works from the moment a stream starts, to how it arrives on your device. In many cases, multiple tools and services overlap or work asynchronously, so we’ve included those teams where it made the most sense. Feel free to read in order or jump to the team you’re most interested in reading about.
The orgs we’ll be highlighting today are:
- Video: One of the largest, low-latency, live video distribution systems in the world.
- Edge: Daily processing of billions of API requests to enable our community-building products
- Community: The tools and services that help our streamers keep their communities engaged
- Web and Client Applications: The foundation on which all of Twitch is built
- Community Discovery Machine Learning (ML): The tools that let us personalize experiences across Twitch
- Monetization: How we help our streamers thrive and grow their business
- Data Platform: Providing data products and infrastructure to support analytics and engineering
- Security: Maintaining community trust by ensuring our data is safe and our systems are secure
So if you want to see how our engineering teams come together to create the tech that makes a real impact on our streamers and viewers’ lives, read on!
What We Do:
We build and improve Twitch’s comprehensive low-latency streaming solution, so our streamers can focus on creating engaging interactive experiences for their viewers. We also develop the tools and services that let other companies leverage Twitch’s live-streaming tech. In this section you’ll learn how streamers and viewers connect to our servers, how we distribute live streamed around the globe, and how we save Videos on Demand (VODs).
How do you take hundreds of thousands of simultaneous live video streams and distribute them world-wide? The Video Ingest team builds the systems that receive all of our creators’ streams and prepares them for viewing. To do this, Twitch runs data centers around the globe–called edges–that connect to local Internet Service Providers (ISPs). Edges let streamers connect to our network quickly and ensure that they can share the best possible video quality. Once they connect and begin sending their video, the edge chooses one of our larger data centers, called Origins. These contain hundreds to thousands of servers dedicated to transcoding streams into different bitrates and formats for the Twitch Content Delivery Network (CDN). This ensures viewers anywhere and on any network can watch their favorite streams. Learn more about how this works in our blog post Ingesting Live Video Streams at Global Scale.
The next step is transcoding, which takes our creators’ video streams and converts them into formats that allow for a great playback experience, regardless of the viewer’s device capabilities and network conditions. So whether you’re on the bus and tuning-in from a phone, or you’re at home on your laptop, you get the best quality stream. To do this, our transcoding system takes an incoming Real Time Messaging Protocol (RTMP) stream from the creator, and turns it into an HLS-compliant stream. This process creates several variations of the stream with a range of video resolutions and bitrates, allowing viewers to watch at the highest possible quality. Recently, we prototyped a new server that uses ASICs to transcode streams at over 10x the scale of our previous solution (and with better video quality!), which will help support our growing community of streamers and viewers.
Once a stream arrives at an Origin and is transcoded, Video Distribution is in charge of ensuring it gets delivered to viewers as quickly as possible. We maintain a network of data centers distributed across the world, which are organized in a directed graph hierarchy we call a Replication Tree. When a viewer loads a live video page on Twitch, their request is routed to the nearest available datacenter, which forwards the stream request up the Replication Tree. This eventually reaches the Origin, which then returns the stream data.
VOD (Video on Demand)
Streamers know that their viewers can’t always watch greatness happen live, so they’ll record a Video on Demand (VOD) of their steam to save and distribute their content. If a streamer has enabled recording on their channel, our system stores a copy of all generated segments, thumbnails, and playlists in one of our Simple Storage Service (S3) buckets. Then, streamers can edit their VODs into smaller segments called “Highlights.”
VODs should load quickly and look good. We create playlists that feature multiple versions of every VOD, each at a different level of quality, and enable low-latency global playback.
The Metadata and Post Processing System keeps track of all VOD types and performs post-processing jobs, like users requesting to download their videos as mp4 files or uploading their videos.
What We Do:
We build the tools that let our community building products come to life. We’ll share how our service allows developers to stay up-to-date with new changes, and how third party applications receive notification from our systems.
The Event Bus is an internal platform that simplifies the workflow required for developers at Twitch to publish and subscribe to real-time events quickly, reliably, and securely. The Event Bus introduces event schemas and tooling so subscribers don’t have to deal with breaking changes. Additionally, the Event Bus offers a client library that handles published events, encrypting and decrypting payloads for authorized field access, and multiplexing on ingested events. Developers are able to create Amazon Simple Queue Service (SQS) queues that can be used to receive events. The Event Bus is currently responsible for publishing ~1660 events per second to Amazon Simple Notification Service (SNS) and is used by developers to create both internal and external features.
The EventSub allows third party applications to receive notifications for events in real-time. The primary delivery method for these notifications is currently webhooks. These applications interact with EventSub through Twitch’s API, where they can create, get, or delete their subscriptions to particular Twitch events. Successfully-created subscriptions will contain callback information which instructs EventSub on where to send notification payloads; this information, along with other metadata, is stored in Amazon DynamoDB. Furthermore, EventSub hooks into the internal Event Bus to ingest events. If an event has a subscription that relies on it, EventSub builds a notification to be sent asynchronously to a Dispatcher via an Amazon SQS queue. This Dispatcher is responsible for fanning out these notifications to the proper recipients.
Helix powers Twitch’s Application Programming Interface (API), which is used by over 25K third-party applications. Helix’s API endpoints enable these applications to integrate with Twitch features that have a huge impact on the streamer and viewer experience, including channels, streams, subscriptions, Bits, Hype Trains, raids, channel points, polls, predictions, schedules, Clips, Chat, moderation, Drops, and Extensions.
What We Do:
We build the tools that help streamers attract, entertain, and recognize their viewers so they can keep their communities engaged. We will share how Twitch chat works, how we tailor recommendations based on location and time, and how we help smaller streamers get noticed.
Scheduling Relevant Streams around the World
One of the first things you see when visiting Twitch are featured and recommended streams, and creators. We want scheduled streams to be relevant, engaging, and tailored to the interests of each of our viewers, regardless of when or where they’re watching Twitch. This means accounting for each user’s local time zone, UTC offset, and daylight savings. We have to take special care to make sure that no matter where in the world a user is, they see the same number of upcoming streams as everyone else.
Discovering a new artist streaming or a special event can be challenging, especially if the creator doesn’t stream frequently. While stream schedules are available on each streamer’s page, there originally wasn’t a centralized place to find schedules based on category and genre. Upcoming gives a platform for discovering some of the coolest upcoming music streams - from amazing songwriters and music producers to concerts, and music talks. We’ve also begun experimenting with this feature in other categories, like esports.
Supporting Small Music Streamers
As a part of music discovery and creator growth, our team’s mission is to help smaller musicians on Twitch grow their channels. Getting started as a musician on Twitch can be challenging, so we’ve created the Support Small Music Streamers shelf to give extra promotion to the smaller streamers who are just beginning their journeys. This shelf provides a dedicated place for them to build their communities, get discovered, and grow. Under the hood, we combine live music category streamer data with data collected from Amazon Redshift to curate the streamers who appear on the shelf.
Community Insights builds and maintains services powering the Channel Analytics and Stream Summary pages, which provide the analytics that tell streamers how they’re doing on Twitch. Whether they’re looking at their most recent Stream Summary, comparing viewership and engagement month-by-month, or learning which emotes their viewers love, Insights empowers streamers to be the best they can be. We ingest over 10 million events per minute, providing real-time and historical metrics like average viewers, new follower counts, top referrers, and emote usage. Read our blog post for more on stream summaries.
Chat & Pubsub
Chat is a core part of the Twitch experience. It is what makes our streamer’s communities come to life, and helps to create legendary moments. But you can’t have a conversation when it takes 30 seconds to see someone’s message. Chat consists of several, highly scalable, real-time distributed systems which deliver hundreds of billions of messages per day to users chatting with their communities. We support Internet Relay Chat (IRC) as a communication protocol, which allows developers to create bots with custom chat features. On our end, this is what happens when a user sends a message on Twitch:
In more detail:
- IRC Edge & Pubsub power the accepting and delivering of chat messages. Edge accepts messages sent from clients and forwards them to Clue for processing, and also accepts messages for delivery from Pubsub and sends them to clients for reading.
- Clue & Zuma are services that evaluate viewer messages against existing rules, like the channel’s automod settings, to determine if a message should be delivered or rejected. Clue also processes requests for user notices that appear in chat, such as resubscription messages.
- Room & Chatters are services that let users see which chatters are connected to each channel.
The Pubsub systems connect back-end events to front-end clients to enable dynamic experiences without needing to poll for changes. Both the event-driven pattern and the connection from back-end services to front-end clients are key parts of building new client features while reducing scalability concerns.
- Pubsub Edge Accepts clients’ requests to subscribe to message topics and delivers messages on those topics back to subscribed clients.
- Pubsub Control performs authorization checks when a client attempts to subscribe to a non-public topic.
- Pubsub Broker accepts requests from Twitch services to publish events to clients via the edge. When the broker receives a message, it fans the message out to all edges with a subscription to the message’s topic. Example events include: updates to viewer counts, hype train progress, and poll voting status.
What We Do:
We help develop the platform that lets viewers actually experience Twitch! We’ll share how our website, Android, and iOS apps work.
Twilight is the desktop web application served from our website. It also encompasses the set of tools that enable us to efficiently and safely make changes to the site. Twilight can see as many as 140 developers contributing code in a single month, so it’s important to encode as many best practices into tooling as possible.
On a typical day, Twilight reports around 230 million page load events across the 80 or so pages that it comprises–mostly from the channel page.
Tachyon & the Living Room
Many of our web applications are part of a large monorepo called Tachyon, where they share over 70 packages to create a variety of unique experiences. Similar to Twilight, these applications are based on TypeScript and React, while also incorporating libraries like Next.js and Relay. The flexible Tachyon framework enables us to build speedy server-side-rendered applications like our mobile web site or tightly-scoped special-purpose ones like our new appeals site.
Tachyon is also home to our Starshot living room platform, which adds React Native into the mix to create a universal application suite. This platform combines web technology and native launcher applications to unlock rapid and consistent feature development across a wide range of targets. Starshot powered our Samsung, LG, and Switch launches, and we’re in the process of converting our existing applications to Starshot (PlayStation, Xbox, Android/Fire TV, and Apple TV).
On an average day, Twitch’s Android and iOS apps are opened millions of times from millions of unique devices. Both apps are native, and the majority of the client code is written by our engineers.
Our Android codebase is mostly written in Kotlin and our iOS codebase is written almost entirely in Swift. Both are maintained with an in-house framework. We have also been developing our own native UI libraries which adhere to the Twitch Design language system.
What We Do:
We build the machine learning tools that make personalized recommendations, searches, and experiences for our users. Whether you’ve been on Twitch for 10 seconds or 10 years, these tools play a big role in your daily experience. This section covers how our Search function, recommendations, and underlying ML Infrastructure work.
The search team’s goal is to help viewers find content they want to watch and communities they want to join. With that goal in mind, in 2019, we set out to rebuild the entire search experience from the ground up. Prior to that we relied almost entirely on a third party to provide search capabilities to Twitch users.
Today, the search experience is powered by 4 primary components: an API, a retrieval system, a ranking system and content ingestion pipeline. The API orchestrates the flow of the request. It handles experimentation, event tracking, and calls to retrieval and ranking systems. The retrieval system is built using Amazon OpenSearch service and returns the best lexical matches for different types of content (live streams, channels, categories and videos). The API then passes those results to the ranking system which uses machine learning to re-rank the top results. The search team also maintains machine learning training pipelines to train and deploy machine learning models that are used for ranking.
Whenever you visit Twitch, you’ll see multiple live channels that we think you’ll enjoy. This is our recommendations system at work. Our goal is to personalize Twitch based on what you watch and generate the highest quality recommendations, so you can find your next favorite community. Any place where personalized recommendations show up on Twitch sees billions of requests a day!
None of this is possible without model training, deployment, and the generation of features at scale. To orchestrate the training and deployment of models, we utilize our own framework which leverages Apache Airflow. The trained model metadata is stored in Model Registry, a service which is responsible for registering and rolling out models. This enables us to serve the freshest recommendations. Additionally, Twitch created Machine Learning Feature Store (MLFS), a framework that enables the central storing and sharing of features by different teams across Twitch. Through datastores such as MLFS, our service TwitchModelBlender can thus access transformed data at request-time.
Machine Learning Infrastructure (MLI)
How do you provide relevant and personalized recommendations to viewers, when their habits are constantly changing? That is where our Machine Learning (ML) algorithms come in! Our team builds scalable infrastructure and tooling that solves common problems across product teams. As part of the Community organization, we work closely with Recommendations, Search, Safety and other ML product teams.
Currently, the team owns a few core ML infrastructure components: Machine Learning Feature Store (MLFS), Real-Time Feature, Model Registry and Model Deployment System.
- MLFS is a feature managing and sharing framework. It is built in a federated fashion such that every team can operate their own MLFS instance and can also be extended to support other data backends.
- Real-Time Feature is a set of tools to help developers easily build streaming features on top of AWS Kinesis data streams and make them available in MLFS with seconds of latency.
- Model Registry stores the source of truth for metadata of all trained model instances and provides isolation between model training pipelines and the production services which depend on the trained models.
- Model Deployment System is a model CI/CD pipeline for real-time inference models, and it provides capabilities of integration testing, canary environment and automated rollout/rollback for safe and streamlined model deployment.
We continue to innovate on our machine learning infrastructure so that we are able to carefully orchestrate a continuous cycle of learning, inference, and observation for ML teams at Twitch.
What We Do:
We build, develop, and scale the tools that let viewers support their favorite streamers, so that they can earn a living doing what they do best. This section covers how Affiliates and Partners get paid and how Ads on Twitch work.
The Commerce organization has over 40 microservices that combine to enable purchases on Twitch.
The process starts when a creator is onboarded to the Affiliate & Partner program. We start by taking creator information and integrating it with our Amazon systems and third-party vendors for bookkeeping and verification purposes.
There are many locations on Twitch where users can make purchases, and when a purchasable Product and its Benefits are shown to the user, we check to ensure purchase eligibility. The Purchase Orchestration Service integrates with multiple payment vendors to provide a convenient checkout experience with a variety of payment options.
After a purchase has been made, the Purchase Orchestration Service finds the service responsible for the product line (we call them Tenant services) and issues a fulfillment request.
Our work isn’t complete until streamers are paid out! Purchase data that attribute revenue to our streamers are collected from different revenue streams and services with a combination of pull and push patterns to calculate creator earnings. This data is exposed on the creator revenue dashboard page and passed on to our finance and accounting teams for review. Finally, after the review is concluded, our third-party vendor handles transferring money to their bank accounts.
Personalization & Fraud Detection
Twitch generates billions of data events every day and our teams have specialized in building data pipelines and machine learning models to leverage that information. These teams empower a more intelligent, personalized and secure environment for our users. Their work enables Twitch to select more interested gift subscription recipients, configure the most engaging Hype Train settings and identify instances of fraud.
The Ad Product Engineering team builds and maintains the technology that powers the advertising experience on Twitch. We bring viewers, streamers, and advertisers together in order to give viewers a way to support their favorite streamers, and to provide streamers with additional revenue sources that enable them to earn a living doing what they love. We obsess over how to do this in a way that is authentic to the streamers’ communities and minimizes disruption to the viewer experience. To do this, we own a broad range of technologies including:
- Client implementations for rendering ads: We build many of the components that are needed to render the ad experience. This includes video player components on web and mobile devices, counters, transition elements, and our Player by Player solution that keeps the stream visible while an ad plays.
- Service Layers: We employ a robust set of services that evaluate opportunities to show ads and identify the best set and types of ads to show. We leverage many AWS solutions including: Amazon EC2, AWS Lambda, Amazon DynamoDB, Amazon Redshift, Amazon S3, and Amazon Kinesis.
- Building tools and integrations with third parties: We build brand safety, Ad fraud prevention, and measurement tools, and integrate with industry leaders to ensure every dollar an advertiser spends goes to legitimate streamers who are working hard building their communities.
- Ad Server Tech: We partner closely with Amazon and the Amazon Ad Technology teams to implement some of the most complex and advanced advertising technology solutions in the industry.
What We Do:
The Data Platform organization develops and operates the platform that powers Twitch’s data systems and decision-making. We also make sure that each team at Twitch has the tools to apply and transform this data and make the best decisions and products possible. We own the main data lake and decentralized data warehousing infrastructure at Twitch, which, collectively, are known as Tahoe, after one of the largest lakes in California.
- Tahoe consists of a central batch-ingestion API, which transforms and stores data in Amazon S3, and over 100 self-serve Amazon Redshift clusters that teams can spin up to query this data via AWS Glue Data Catalog and Amazon Redshift Spectrum. This decentralization ensures that teams aren’t competing for resources like you might see on a more centralized platform. Tahoe currently contains over 100 petabytes, despite daily compaction and deletion of unnecessary data.
Most of Data Platform’s other offerings are built on top of this core service, like our near-real-time streaming ETL:
- Spade ingests an average of almost 3 million events per second into our data lake. Spade can also publish to user-provided Amazon Kinesis streams to allow time-critical applications to stream data directly, instead of waiting for it to be available in Tahoe.
We also own a self-serve aggregate reporting service, Sheik, which enables some of the most crucial reports at Twitch.
- Sheik provides an easy way for data staff to run aggregate reporting by managing and computing a complex directed acyclic dependency graph of over 1000 queries on a daily basis. It also manages the backfilling process to facilitate historical insights.
Finally, we have a reporting service built on Apache Druid, schema management services, an experimentation service, and a number of tools for effectively interfacing with our data ecosystem.
What We Do:
We build the tools that ensure our data is safe and our systems are secure, so that our community’s information stays secure.
AWS Security Compliance
Twitch develops features fast, and Security needs to detect insecure AWS configurations just as quickly. As of 2022, Twitch has over 2000 AWS accounts containing production services, legacy services, features in development, and abandoned features. To help with compliance in this complex environment, Twitch Security has developed its own in-house data lake. Any security compliance campaign is just an SQL query and a day’s setup away.
For security campaigns that affect over 1000 resources, Twitch Security has developed two major capabilities to reduce toil. We heavily utilize AWS Organizations StackSets to deploy roles that allow our InfoSec team to centrally patch insecure configuration automatically. For configuration that lives in Infrastructure as Code, we develop automatic changes to our service teams’ libraries.
This post was put together by Avery Gnolek and Asa Witt. While it was impossible to include every team at Twitch, we’d like to thank all of the engineers, managers, product managers, and directors who helped author and review this post: Benjamin Hirsch, Ryan Bickhart, Samantha Briasco-Stewart, Gowdish Kumaraswamy, Igor Pogorelskiy, Jordan Potter, Chloe Shiau, Ishneet Kaur, David Wixon, Matt Walsh, Cameron Martin, Risto Keravuori, Sylvia Han, Andrew Millman, Jacques-Olivier Bernier, Farhan Zaidi, Nanki Grewal, Marinette Chen, Chen Liu, Srinivasa Kolluri, Leo Sun, Robert Ricci, Geoffrey Chan, Efoe Adjakly, and Angelo Paxinos.