Squash Flaky Tests Like The Bugs They Are

By David Ramos | July 18, 2022

As a software engineer working with continuous integration (CI) systems at multiple companies, I experienced first-hand how frustrating flaky tests are. Just when you’re ready to merge a cool new feature, a test unrelated to your change fails, causing the entire CI build to fail.

Flaky test failures happen all too often, forcing developers to retry their entire CI pipeline, sometimes more than once! Retries can delay code merges by hours. If the delay leads to a merge conflict, the process begins all over again.

Over time, spurious test failures lead developers down a dark path, working around the issue by reflexively retrying CI failures or even preemptively running multiple CI attempts in parallel. The end result is plummeting productivity, skyrocketing CI costs, and frustrated developers considering other employment opportunities.

Flaky tests, like bugs, are inevitable

Regression tests are an invaluable tool to ensure that new features, code refactoring, or other changes don't break existing functionality. The more comprehensive a test suite, the more confident developers can be that a code change is safe.

Unfortunately, the more tests we write, the greater the odds of introducing some that are flaky (i.e., that unexpectedly fail sometimes without any related code changes). While it's tempting to think that flakiness can be completely avoided by following certain best practices, large suites of ever-evolving tests are destined to contain at least a few flaky tests. Google, for example, found that almost 16% of their tests exhibited flakiness. Spotify reported 6%.

Experienced developers understand that despite their best efforts, any complex system they build will contain bugs, and that fixing them all immediately does not scale. This realization doesn't mean caring any less about code quality. Rather, it unlocks a range of solutions that help to mitigate the impact of those bugs.

For example, developers learn to code defensively so that unexpected errors in one part of a system don't cascade into a widespread outage or data corruption. We use modern observability tools to understand how production systems behave and, crucially, misbehave. Perhaps most importantly, we use bug trackers to systematically prioritize and assign bugs.

Flaky tests are bugs too, and we should treat them accordingly. This means tracking flaky tests so that they can be fixed and mitigating their impact in the interim.

The test that cried wolf fail

Continuous integration systems exist to help developers move quickly and safely. Unfortunately, a few flaky tests can easily destroy confidence in a team's CI pipeline. I've experienced this first-hand, and perhaps you have too.

When a test causes a CI failure, it takes time to examine the log output and determine the root cause. If a developer finds that the cause is unrelated to their code change and is instead due to an unreliable test, they're likely to be frustrated. The level of frustration increases if the test that failed is in a section of the codebase that they're unfamiliar with—perhaps owned by another team—that they're not even empowered to fix.

This situation leaves developers with two options:

Retry the entire CI pipeline
Ignore the failure and merge the code change

Developers are justifiably wary of merging broken code, so we typically retry the CI build, hoping that the flaky test will pass the next time around. Unfortunately, most CI systems don't let you retry individual tests, so this retry probably includes a full build and every other test as well. The larger the codebase, the longer the retry will take. Time to grab a coffee!

Over a period of weeks or months, the process above can repeat dozens of times. Eventually, developers get so burned by the time they spend examining test logs that many stop bothering. Instead of looking into why a test failed, they reflexively retry any failed CI build. Only after a build fails multiple times is it worth their time to look into the cause. This is one of the first signs that flaky tests are having a serious impact on a team.

At this point, it may be tempting to disagree and assert that developers should always look at test failures, shouldn't blindly retry CI builds, and should fix broken tests immediately. However, developers are human, and we're affected by a variety of organizational pressures such as tight product deadlines. A more pragmatic approach than scolding developers is to understand why they're taking certain actions, and to address those causes.

Eventually, if flaky tests continue to cause CI failures, exasperated developers may choose the riskier option above: merging code despite test failures. However, some of these test failures will inevitably be caused by the code change itself, and the merged code will either break the CI build for the whole team or introduce bugs into the product. Once this happens, a CI pipeline has failed to accomplish its purpose: ensuring that only safe code can be merged or deployed to production.

If the arc of this story sounds familiar, perhaps that's because it so closely resembles Aesop's Fable The Boy Who Cried Wolf. The fable teaches us that organizations would be wise to avoid CI pipelines that routinely issue false alarms about the safety of a code change.

Mitigating test flakiness

The most common approach I've seen for managing flaky tests is to do nothing. Sometimes flaky tests are easy to fix, but proper fixes often require significant time investments that compete with tight product deadlines and bugs that affect end-users. Developers are hesitant to disable flaky tests for fear that they'll lose test coverage. They're also concerned that disabling a flaky test means that no one will ever fix it.

…it's critical to fix flaky tests, and to eliminate their negative value until they can be fixed.

However, a flaky test has already reduced test coverage since developers ignore their failures, and the loss of confidence can easily splash onto other, non-flaky tests. In other words, flaky tests add negative value to a CI pipeline. For this reason, it's critical to fix flaky tests, and to eliminate their negative value until they can be fixed.

This realization has led many industry-leading software companies to build automated systems for tracking and quarantining (i.e., disabling) their flaky tests at scale, including:

All of these systems involve both tracking flaky tests as bugs, and quarantining them to protect the CI pipeline in the interim. Prioritizing fixes for these tests requires organizational incentives and a strong company culture, not relying on developer frustration as a forcing function for test improvements.

Unflakable, in a nutshell

While large software companies can afford to have dedicated developer tools teams, smaller companies and startups typically can't. This gap is what inspired me to build Unflakable—so that every team of developers can access purpose-built tooling to solve their flaky test problem.

Unflakable launched its open beta in May 2022 and currently supports Jest (JavaScript/TypeScript) and PyTest (Python) tests running on any CI platform. Teams using Unflakable receive email notifications when flaky tests are discovered, and they can optionally quarantine those tests either automatically or manually through Unflakable's web application. For more details, check out the How it Works documentation.

We'll be adding lots of exciting new features in the months to come, including integrations with common bug trackers and team chat applications. We'll also be adding support for other languages and test frameworks.

Need support for a particular test framework? Let us know!