As front-end developers, we don’t often hear about Chaos testing. A project related to Twitch front-end availability led us to research the exciting field of Chaos engineering, pioneered by Netflix. Chaos engineering is the science of optimizing the resiliency of a software system by simulating failures and measuring the impact of these failures on the system. These simulations help anticipate real world issues before they happen, and ensure our systems can degrade gracefully. The practice is commonly leveraged for backend and distributed systems, and more and more tooling is being developed in this space.
On the front-end however, we haven’t observed much discussion or activity from web or mobile communities to ensure our clients are as resilient as they should be.
The ultimate goal for us was to be able to answer this question: “If this part of our overall system (e.g. backend service, third-party API) fails, how does our front-end behave and what do our end users see?”
At Twitch, our data is served by an ever growing collection of microservices (hundreds at the time of writing) that depend on each other. These services are abstracted away from our front-end clients (website, mobile apps, console apps, desktop) via a single GraphQL API.
The most common failure that happens in a system like ours is one of our microservice errors out and fails to serve its portion of the data. In this case, GraphQL does a great job of forwarding partial data to the client when configured properly. From there, it’s the client’s job to handle this partial data gracefully and provide the best degraded experience as possible.
With that common use case in mind, the goal became to find a scalable way to test these scenarios and observe the behavior of each client.
Let the Chaos engineering begin! First, we need a way to simulate failures reliably.
The first option considered was intercepting and altering the network responses received. A chaos interceptor would add the ability to inject errors in the GraphQL responses. These errors could be predetermined, or randomly injected.
While exploring this option, we realized that since the vast majority of our API calls go through GraphQL, we could potentially pass a special instruction with our GraphQL calls to simulate certain failures at the GraphQL resolver side. Luckily, it turns out that this capability was already implemented during one of our Hackweeks, and aptly named Chaos Mode.
Chaos Mode adds the ability to pass an extra header to our staging GraphQL calls. Within that header, we can pass the name of one or more services that we wish to simulate failures for. Our staging GraphQL resolver reads this header and short circuits any internal call to those services.
Once we had a way to simulate failures, the next step was to automate the process. At Twitch we make heavy use of automation to test our front end user flows. Those are scripted tests that simulate users manipulating our website or our apps, verifying that everything shows up as expected. We realized that it would be fairly easy to automate our failure simulations by running our core tests with the special Chaos Mode header set. This setup can tell us if a particular service failure breaks those core flows or not, which is getting very close to our goal!
After having a quick prototype running on Android, we found that there were a couple of missing things to make this scalable:
In order to address the first point, we needed a way to “trace” the GraphQL calls from the client and record which internal services get hit for a certain user flow. We solved this by using another debug header in our GraphQL calls, which enables tracing at the GraphQL resolver layer. The resolvers then record any method call done to its internal service dependencies, and send the information back to the client in the same GraphQL call. From there, the client can extract the service names that were involved in a set of GraphQL calls, and use that as the input for our Chaos Testing suite.
We now have a two step process:
With this setup, we now have an automated way to discover services and then force them to fail in a consistent, reproducible manner. We can then scale this up to work with any number of automated tests.
With this system in place, we went on to extract and record interesting information that will describe the user experience during these forced failures.
We use the test results for a single user flow (ie. navigating to a screen/page, watching a stream, sending a chat message, etc) that runs n times - once for each service failure - and use them to calculate a resilience score for that particular test. A simple percentage number that represents how many times the user flow was successful despite the ongoing service failure. A global resilience score is then calculated for the client, giving a high level overview of how sensitive each client is to failures.
We can use these scores to track the resilience of our user flows over time, measure the improvements we make to our code and also make sure we don’t accidentally regress. We package these scores and other useful information and upload it to a server so we can visualize it all in a web dashboard.
During the test, we capture API calls, extract useful information like which particular GraphQL had errors, what field inside of that query caused the error and we also capture screenshots of the expected state and error state. All this information is helpful to diagnose what exactly happens to the clients for a given failure.
Using this dashboard, we identified a big gap between the Android and iOS resilience scores. Android was showing an overall score of 82% whereas iOS was only at 64%. Using the debug information provided, the iOS team identified the root cause: an over-defensive handling of GraphQL errors deep in the network stack. The team was able to test their fix by re-running the automated chaos tests, and verified that the fix significantly increased resilience, boosting the iOS resilience score to 84%.
Right now we run this tool every night on our Android, iOS, and Web clients. This gives us insight into our current frontend resilience to backend service failures. We can now track how our resilience scores change over time, and easily test failure scenarios that were difficult to test in the past.
We plan to expand the coverage by adding more user flows to our automated tests, beyond our core flows. Adding a new test in the system is as easy as writing any automation test for the platform, and we’re currently working on adding mobile web as one of the target platforms.
We also plan to improve Chaos Mode and the ability to trace service calls to address the current limitations. The highest priority is the ability to trace and fail secondary service calls (service to service interactions). Other features we’re considering is to be able to fail individual service calls and the ability to fail multiple services at once.
The discipline of Chaos engineering might be established for backend distributed systems, but it feels like there’s still a lot to explore for the front end! It’s been amazing to see these tools bringing engineers from different fields together, collaborating and giving them much deeper visibility into our system end-to-end.