Flaky Tests: Causes, Detection, and How to Eliminate Them

Name: TestInspector
Price: 149 USD

A flaky test is one that produces inconsistent pass/fail results across multiple runs against the same codebase without any change to the application or the test. Flaky tests are one of the most damaging reliability problems in automated test suites because they create noise that engineers learn to ignore — which means genuine failures get ignored alongside false positives. The primary causes of flakiness are timing dependencies, shared or polluted state between tests, non-deterministic test data, and external service instability.

Why Flaky Tests Are a Serious Problem

A failing test produces two possible responses from an engineer: investigate the failure or re-run the test. When a test suite has a high flakiness rate, re-running becomes the default response. Over time, this trains engineers to distrust test results, which means that a real regression may be re-run past rather than investigated. The test suite has failed in its primary purpose: providing reliable early warning of regressions.

Flaky tests also impose direct cost. Each flaky run requires manual investigation to determine whether the failure is real. In a CI pipeline that gates deployments on test passage, a flaky test can block a release until someone confirms the failure is non-deterministic and re-runs the pipeline. At scale, the aggregate engineering time spent on this pattern is significant.

Industry data shows that even a small percentage of flaky tests compounds quickly across a large number of daily CI runs. For teams at all scales, flakiness is a maintenance priority, not a cosmetic issue.

For background on building reliable test suites, see Astaqc's test automation services and the complete software testing guide.

Root Causes of Flaky Tests

Most flakiness falls into a small number of root cause categories. Understanding which category affects a specific test is the prerequisite for fixing it.

Timing and asynchrony. The most common cause of flakiness in browser tests. A test clicks a button before the resulting network request completes, asserts on text before a component re-renders, or checks for a modal that takes 200ms to animate in. The test passes when execution happens to be slow enough to wait; it fails when the CI machine is faster. Playwright's built-in element auto-waiting and Selenium's explicit WebDriverWait address this, but only when applied to every asynchronous operation — a single missing wait is enough to make a test non-deterministic.

Shared state between tests. Tests that modify shared database records, browser cookies, local storage, or cache without resetting after each run leave the environment in an unpredictable state for subsequent tests. The failure often appears in a different test than the one that caused it, making it particularly hard to diagnose. Each test should create its own state and clean it up, or run in a completely isolated environment.

Non-deterministic test data. Tests that depend on data that changes — current time, randomly assigned IDs, records created by other users in a shared staging environment — will fail when the data does not match expectations. A test that asserts the first item in a list is a specific record will fail whenever someone adds a new record that appears first. Tests should control their own data rather than depending on ambient state.

External service instability. Tests that call real third-party APIs (payment processors, email services, authentication providers) are subject to those services' availability and response time variation. A test that takes 30 seconds when a service is slow will fail a 25-second timeout. External service calls in tests should either be mocked or have explicit retry and timeout handling for known variability ranges.

Resource contention and environment variability. Tests that depend on available CPU, memory, or network bandwidth can fail on under-resourced CI runners. An absolute time wait that is sufficient on a developer laptop may be insufficient on a shared CI runner under heavy load. Element-condition waits are the solution; absolute time waits are the symptom.

Test order dependence. A test that assumes it will always run after a specific other test relies on ordering rather than isolation. Test suites that shuffle execution order expose this immediately; suites that always run in the same order hide it until parallelization is introduced.

How to Detect Flaky Tests

Detection requires running the same test suite multiple times against an unchanged codebase and tracking which tests produce different results across runs. A test that passes 90 of 100 runs has a 10% flakiness rate; a test that fails once in 20 runs may go undetected for months.

Re-run failed tests automatically. Configure your CI system to re-run failing tests once before reporting a failure. A test that fails on the first run and passes on the second is almost certainly flaky. Many CI systems (GitHub Actions, CircleCI, Buildkite) support this natively. The re-run data, accumulated over weeks, produces a ranked list of your flakiest tests.

Run tests multiple times in parallel. Run the full suite several times concurrently and compare results. Tests that produce inconsistent outcomes across instances are flaky. This is the most direct detection method but requires CI budget for parallel execution.

Track failure rates over time. Instrumented CI pipelines that store test results with timestamps and run IDs reveal flakiness patterns. A test with a 15% failure rate over 100 CI runs is a known flaky test. Most test reporting tools (Allure, TestRail, Buildkite Insights, Cypress Cloud) provide this visibility.

Monitor for test order sensitivity. Run the test suite in random execution order and compare results to deterministic-order runs. Test order sensitivity is invisible in single-run or fixed-order pipelines and only manifests when execution order varies.

Isolate the test environment. Tests that pass locally and fail in CI, or pass in sequential execution and fail in parallel execution, are usually flaky due to environment differences or shared state. Reproducing CI conditions locally narrows the cause. See the manual vs automated testing guide and Astaqc's software testing services for structured approaches to test environment design.

Quarantining Flaky Tests

Before eliminating flakiness, quarantine it. A quarantined test runs in CI but does not block deployment on failure. This removes the immediate pain — engineers stop losing time re-running pipelines — while preserving visibility into the flaky test's behavior for analysis.

Most test frameworks support a skip or pending mechanism. Some teams maintain a known-flaky list excluded from the blocking gate but included in a separate reporting job. Quarantined tests should have a deadline: a test that is quarantined indefinitely is a test that is never fixed. A two-week quarantine window with an assigned owner is a workable policy.

Flaky Test Root Causes and Fixes: Comparison

Root Cause	Typical Symptom	Fix
Timing and async	Fails on fast CI, passes locally	Replace sleep/fixed waits with element-condition waits
Shared state	Fails after a specific other test runs	Reset database/storage after each test; use test isolation
Non-deterministic data	Fails when ambient data changes	Control test data creation; avoid ambient state assertions
External service calls	Fails intermittently on API timeout	Mock external services or add explicit retry logic with jitter
Resource contention	Fails only on loaded CI runners	Increase timeout budgets; move to dedicated CI runners
Test order dependence	Fails only in specific execution order	Enforce test isolation; run suite in randomized order
Selector instability	Fails after UI changes with no code change	Use stable selectors (data-testid, ARIA roles, visible text)

Strategies for Eliminating Flaky Tests

Replace all absolute waits with condition waits. Every sleep or fixed-duration wait in a browser test is a red flag. The correct pattern is waiting for a specific element state: visible, enabled, containing expected text, or detached. Playwright's locator API auto-waits on all interaction and assertion calls. Selenium requires explicit WebDriverWait conditions. Time-based waits are almost always too short on slow machines and wasteful on fast ones.

Enforce test isolation at the data layer. Each test should set up the data it needs, run against that data, and tear it down. Database transactions rolled back after each test, or fresh database instances per test run, provide complete isolation. Tests that share a staging database with real users or other test runs are inherently non-deterministic unless data access is scoped to identifiers controlled by the test.

Mock external dependencies. Any test that calls a real external API depends on that service's availability and response time. Mock these calls at the HTTP layer using tools like nock, WireMock, or MSW (Mock Service Worker). The test validates the application's behavior given a specific response; the service's behavior is tested separately in integration tests that run less frequently.

Use stable selectors. CSS selectors that reference generated class names, DOM position, or framework-internal attributes break whenever the underlying implementation changes. Prefer data-testid or data-cy attributes placed explicitly for testing, ARIA roles and accessible names, or visible text. These attributes are stable across refactors because they express the element's purpose, not its implementation. See Astaqc's QA team service for expert support on selector strategy.

Delete tests that cannot be fixed. A test that has been flaky for six months with no path to a fix is not providing value. It is consuming CI time, developer attention, and blocking deployments. Delete it and replace it with a deterministic test that covers the same behavior through a different mechanism. The goal of a test suite is reliable signal, not maximum test count. See Astaqc's testing documentation services for help defining test standards.

Preventing New Flaky Tests

Code review for test quality. Review test code with the same rigor as production code. Flag absolute waits, missing isolation, and external service calls without mocking. A checklist in pull request templates for test-relevant changes reduces the rate of known-bad patterns entering the suite.

Run new tests multiple times before merging. Require that new tests pass several consecutive CI runs before the PR is merged. A test that passes 5 of 5 runs has a lower probability of being flaky than one that passed once. This catches timing-sensitive tests before they enter the main suite.

Monitor flakiness rate as a team metric. Track flakiness rate (flaky failures as a percentage of total test runs) over time and set a threshold. Treat breaches as a maintenance priority. The software testing cost guide provides a framework for quantifying the cost of test suite unreliability.

Invest in test infrastructure. Dedicated CI runners with consistent resource allocation, containerized test environments with pinned dependencies, and database snapshots for fast isolation all reduce the environmental variability that enables timing and state-related flakiness. For guidance on test infrastructure strategy, see Astaqc's test automation services.

Frequently Asked Questions

How do I find which tests are flaky in my existing suite?

The most reliable method is to configure CI to re-run failed tests once and log both the first-run and second-run result. Tests that fail on the first run and pass on the second are flaky. After a few weeks of data, sort your test results by failure-then-pass rate to produce a ranked flakiness list. Most CI platforms support automatic re-run on failure natively.

Is it ever acceptable to use sleep or fixed waits in browser tests?

In rare cases, yes — when a specific delay is a business requirement (a deliberate rate limit, an animation that must complete before the next interaction is valid, or a third-party script that injects content after a fixed delay). In all other cases, a fixed wait is a sign of missing element-condition waiting and should be replaced. Many teams enforce a lint rule that flags time-based waits in test code.

Can I use retry logic to solve flakiness instead of fixing the root cause?

Retry logic is a useful quarantine strategy, not a fix. Automatically retrying a flaky test masks the symptom — the pipeline passes — but does not eliminate the non-determinism. In high-volume pipelines, a flaky test with automatic retry costs more CI compute than a reliable test. Fix the root cause; use retry as a short-term quarantine while you do.

How should I handle flaky tests that I cannot reproduce locally?

CI-only failures usually indicate environment differences: different browser versions, different resource constraints, different network latency, or different database state. Start by comparing the CI environment specification against your local environment. Add verbose logging to the test to capture element state and page content at the point of failure. Run the test in a Docker container matching the CI image locally to reproduce the constraint.

What is the difference between a flaky test and a brittle test?

A flaky test fails non-deterministically on the same codebase. A brittle test fails deterministically when the application changes in ways unrelated to the test's intended coverage — a renamed CSS class, a restructured DOM, a moved UI element. Flakiness is a timing or state isolation problem; brittleness is a selector stability and abstraction problem. Both degrade suite value, but they have different root causes and different fixes. See Astaqc's manual testing services and performance testing services.

How do I convince my team to invest time in fixing flaky tests?

Frame it in concrete cost. Count the number of CI pipeline re-runs caused by flakiness over the past month and multiply by the average pipeline runtime and the engineer-minutes consumed investigating each failure. A pipeline that re-runs 50 times per month with a 20-minute runtime and 10 minutes of investigation time per failure represents 25 hours of lost engineering time monthly — more than enough to justify a focused maintenance sprint. The software testing cost guide provides a framework for quantifying QA investment returns.

Flaky Tests: Causes, Detection, and How to Eliminate Them

Avanish Pandey

Flaky Tests: Causes, Detection, and How to Eliminate Them

Why Flaky Tests Are a Serious Problem

Root Causes of Flaky Tests

How to Detect Flaky Tests

Quarantining Flaky Tests

Flaky Test Root Causes and Fixes: Comparison

Strategies for Eliminating Flaky Tests

Preventing New Flaky Tests

Frequently Asked Questions

How do I find which tests are flaky in my existing suite?

Is it ever acceptable to use sleep or fixed waits in browser tests?

Can I use retry logic to solve flakiness instead of fixing the root cause?

How should I handle flaky tests that I cannot reproduce locally?

What is the difference between a flaky test and a brittle test?

How do I convince my team to invest time in fixing flaky tests?

Read also: AI in Software Testing: A Complete Guide for 2025

Avanish Pandey

Subscribe to our Newsletter

Latest Article

Kanthi Rekha

The Future of QA: How AI is Redefining Test Automation in 2025

Kanthi Rekha

Why Spain’s Fast-Growing Tech Companies Trust Astaqc for Software Testing

Kanthi Rekha

Scaling a Unicorn? Secure the Best Software Testing for San Francisco Startups with Astaqc Consulting

Kanthi Rekha

Performance Testing in Cloud Environments

Kanthi Rekha

AI-Powered Test Automation: The Future of QA

Kanthi Rekha

The Power of API Testing – Building Stronger, Smarter Digital Experiences in 2025

Kanthi Rekha

Performance Testing in 2025 – Why Speed Matters More Than Ever

Kanthi Rekha

The Rise of AI in Software Testing