Why The Tests Are Randomly Failing

First it passes, no code is changed, the suite is rerun, and it fails.

Also called “flaky tests”, or “flickery tests”, they can plague projects for years, and sometimes never get fixed. Their frustrating existence leads developers to feel helpless and apathetic when they want to deliver quality software quickly. Yet the question remains: Why are the tests still broken?

The Reasons Why

Fixing it is not a priority

This isn’t a bad thing in every business case either. Resources are limited, businesses have specific goals, and if fixing the tests won’t improve the business there’s little reason to prioritize it. Yes, not having tests can be the root cause for many issues, but is that fact clear to decision makers, or just a few contributors close to the issue?

Spending time fixing tests means time not spent on features. If the decision makers aren’t presented with a case (and a plan for what to do about it), don’t expecting a sudden change of heart. If they have no information about cost to fix, cost of not fixing, and no known future benefit, how can an intelligent decision can be made?

Random was a feature, now it’s a bug

At first it seems like an entirely reasonable idea and a great way to get coverage on many use-cases: use randomly generated fixtures or walks through the code. There are three general reasons that “random” is causing tests to fail: (i) the randomly generated data is showing real errors in the code, (ii) random generators are generating incorrect data, or (iii) the test-setup has an inherent assumption that all random data will always be unique, but isn’t (for example, you didn’t expect the random data builder would use the same email account for two different users).

Sequencing issues

Things simply aren’t happening in the right order every time. It’s likely there’s a sequential dependency somewhere that causes a race condition. Technical choices can lure teams into a false sense of security, for example I’ve seen race conditions in PHP tests (a threadless language) because they were running through an Apache server that had a pool of workers, which made RPC calls to another Apache pool of PHP workers. Simply changing the worker pools in Apache can change the test results.

Flaky Infrastructure

When the infrastructure for tests itself is a problem, tests running on it also inherit these issues. For example, if your tests make calls to a third-party “API test server” and that server fails occasionally. Sometimes it could be a database that is running expensive queries for a test. It could even be the test runner itself failing to install a VM image (I’ve seen Travis-CI do this for instance).

Knowing the issues is only enough to help pick a direction, so let’s look at some techniques to really fix these problems.

How To Fix Priority

Tests can’t be fixed unless all members of leadership buy-in to the need. Yes, of course a lone wolf developer can sneak in a day of rapid fixes that might do some good, but this hard work can quickly become undone “when push comes to shove” by other teams only a month or two later who need to release software. We need to fix the issue globally in the organization and for future development, not in isolated patches. We have to fix the source of problematic states, not an individual state.

Bad tests don’t stay in projects “just because”. The tests are broken because the existing process of writing software permits broken tests. The people who approve code accept broken tests.

Typical internal stakeholders of quality are Product Managers, Developers, and Quality Assurance. Your organization might combine some of these roles into one person, but what’s important to remember is that apathy towards quality testing can come from any one of these roles.

In one organization I witnessed a QA team of over 20 people that obscured how many failing tests a release had. Anything above 75% of automated tests passing was considered releasable. This process ran for a few years before developers understood the release criteria. As a thought experiment: Think about what kind of culture and pressure from management could lead to this happening in your own organization.

This is not a technical challenge, it’s a soft-skill challenge. The biggest traps in this issue are often spending your time on these tasks: (i) Preparing a list of broken tests, (ii) finding specific examples of tasks that were slowed down due to broken tests, or (iii) preparing tasks that detail how to fix the tests. Don’t spend time on any of these three traps if management isn’t already interested in fixing tests. These are all either too specific or disconnected from the context that decision makers are operating in every day. These tasks may become useful later, but not until you have everyone aligned on the goal. And by the way, I really mean aligned, don’t think the team is united after talking to one or two people and ignoring half the stakeholders. Even if you get approval from “the boss”, you need to bring the rest up to speed too. Truly everyone has to sit at the same table and want this to be fixed. If you miss this point, you may have tests working again for a few months but will again fall into old habits because you didn’t change how work gets done in your organization.

This is a textbook case of a “Change Management” problem, and there’s many processes available for you to work through. The first steps for many processes typically involve building a vision of the future that will win over stakeholders. A vision can’t simply be a lame user-story like “as a developer I want that the tests will work”, it needs to be tied to a business value your stakeholders care about too like “10% shorter development cycle time”, or “20% fewer critical issues released to customers”. It helps if this is backed up with data, but don’t get deep into very fine-grained details of what is wrong, stay on the decision-maker level. If it turns out failing tests really don’t impact the business, perhaps status-quo is just fine, and this is what the stakeholders will also think.

Only after stakeholders are convinced that “the future must be free from failing tests” can action plans be presented.

How To Fix The Tests

Technical challenges surrounding fixing flaky tests can be divided into two closely-related issues: Predictability, and Isolation. If these two are solved, tests cannot randomly fail.

Predictability

In a nutshell: Every time you run something with the same input, the same result should always happen the exact same way, every time. Unpredictable input might be intentional, but is flickery when implemented in a way that it is out-of-control from the test itself.

One source of “intentional randomness” can come from test building itself. Several popular languages have a library called “Faker” that generates random fake data. For example, if you want a new user, it will generate the name, email, password, etc, all based on a random input value. The issue is these fakers don’t understand the nuances how your software works. They might create multiple unique users with the same email, or you may need 200 users with unique ages (but ages are constrained from 0 to 100). If you’re stuck with using a faker library, look for a setting to adjust the “random seed” initialization, and hope that your Faker is not using Cryptographically Secure Random Number Generators, but instead, some predictable and seedable random value.

Another source of “intentional randomness” comes from calls to datetime functions to discover the time “now”. It works well at runtime, but you can’t predict at what time in the future a test might run. Similarly, use of calls to a random number generator have this same issue. Also, random could secretly be behind the implementation of another concept, for example using a browser-test to cover a page that has A/B testing on it. There’s two ways out of this: (i) Use a testing library that will “freeze” date time and assume tests always run at the same exact point in time. (ii) Use Dependency Injection so your implementation can be overridden and injected with a value your test can be aware of. Almost all languages have some way to let you do one of these options. You must be able to replace the actual date or random values with one a test can control.

Sometimes randomness is unintentional too. All too often have I seen tests fail or start working as performance of a website changes. Usually the code looks like this:

// Using wait like this is bad, don't do it.
const loginHelper = function() {
  browser.get('loginPage');
  wait(1000); // Wait for the page to load
  const clickable = element(by.id('login'));
  wait(1000); // Wait for the element to be rendered
  clickable.click();
  wait(1000); // Wait for the next page to load
};

The problem is that the test execution is not waiting for the browser, and the browser is not waiting for the test, and the system under test isn’t waiting either. The three of them are in a race, and only tended to work because 1000ms was usually long enough for the page state to be settled. If the browser gets slower, or test gets faster, this will break. Worse, when this code is written like this in a big project it will waste a lot of time in your CI pipeline. Debug logs and screen captures will be full of blank browser pages, or the previous page, or sometimes entirely different pages.

Instead, your code should inspect the page state and wait for the page to load, wait for the element to appear, then click it, then wait for the page to change to some new state, then check if the state is what was expected.

const loginHelper = async function => {
  await browser.get('loginPage');
  await waitForElement(element(by.id('login')));
  await element(by.id('login')).click();
};

Of course it’s possible to write this with promises too, or other features your languages might provide. Your browser testing framework might even provide this as part of it’s standard API.

Isolation

On the surface it might seem that if all the individual components of a larger system are predictable, the system itself will also be predictable. The trap here is that each time one complex system is combined with another, a new level of complex interaction is created. Not only this, but if the larger system doesn’t control all the fine details of a subsystem it will increase the lack of control a test is able to enforce. This is analogous to how a human is more complex than their individual cells, but lacks control over what individual cells can do.

The complexity in fully testing a large system is proportional to the complexity of the system under test, and controlling flaky tests only gets more difficult when adding in extra levels of complexity. To be clear, here I mean complexity in terms of “moving parts”, not “mental models people have to reason about their code”; adding a browser test is easily understood as a mental model, but involves many more moving parts than a simple unit test would.

One of the best ways to reduce complexity is to isolate scope. Imagine your next project has the concepts of a “Customer”, “Products for Sale”, “Product Recommendations”, and “Purchasing”. Testing the purchasing system or recommendation engine would require having a customer and products. In theory a developer could easily re-use the test setup for customers and products to re-use logic and make development faster. They have common dependencies, and data will be written to each in different ways. If your development team built “Purchasing” before “Product Recommendations”, it is more likely that a change to purchasing will break recommendation tests, even if the features aren’t really broken! To deal with this, it is possible to reset the database after each group of tests, which is unfortunately quite slow and a side-effect of using the “moving parts” of a database in this hypothetical test. It appears re-use of test setup is possible when a setup has no more than one descendant. In one project I witnessed this issue where over 30 complex tests would fail without obvious cause — many levels of setup dependency away the code that generated a customer wasn’t generating unique IDs, but the surface error would show a failure in the payments system, or the state machine, or the registration process. The entity was central to tests in over 30 different parts of the code, making the cost of replacing it is quite high.

Lastly, a dependency on externally controlled state isn’t always easy to detect either. “Temporal Coupling” is the idea that state is bound to execution sequence, and changing it can break code. Simply not resetting all input states can make fixing flaky tests even harder. Remember, values in a database are in put: they are data put in to the execution of code. Each extra level of state added to the database is an extra level of input.

Similarly, depending on out-of-process calls (perhaps to an API server), or calls to another service in your architecture can also create flaky tests. Sometimes this can be difficult to overcome. As an experiment, let’s assume there is a backend system designed in a way that it must make HTTP calls to some data provider in order to work. That service being called needs to run on a web server, which means the “moving parts complexity” of your tests just gained an entire web-server. If your test runner only controls the web browser, what’s controlling the web-service far away? What happens when your browser opens a page with an iframe (hacked in to meet a deadline) that also makes an HTTP request? Which backend-to-backend request completes first? What is the state of your software? One technique to overcome multiple backend workers is by reducing the test infrastructure down to one web worker, forcing everything to queue up and wait for the oldest request to get processed first.

How to confirm a flaky test is fixed

The hardest part of fixing a flaky test is that it isn’t always obvious when it is really fixed. For example, it might fail only 1 time every 10 builds. The solution is to run it 20 times, and I recommend doing it in parallel by configuring your CI server to spawn 20 workers. You can do this to check if the issue shows up 50% of the time, 10% of the time, or 0.01% of the time. After each step you can get a concrete view of how your changes impacted the “flakiness” of your test suite.

For more reading, I recommend learning about good test design in Chapter 4 of The Tao of Testing.

Brian GrahamMay 17, 2020

Articles and Posts