
June 14, 2026

A flaky test is one that produces inconsistent pass/fail results across multiple runs against the same codebase without any change to the application or the test. Flaky tests are one of the most damaging reliability problems in automated test suites because they create noise that engineers learn to ignore — which means genuine failures get ignored alongside false positives. The primary causes of flakiness are timing dependencies, shared or polluted state between tests, non-deterministic test data, and external service instability.
A failing test produces two possible responses from an engineer: investigate the failure or re-run the test. When a test suite has a high flakiness rate, re-running becomes the default response. Over time, this trains engineers to distrust test results, which means that a real regression may be re-run past rather than investigated. The test suite has failed in its primary purpose: providing reliable early warning of regressions.
Flaky tests also impose direct cost. Each flaky run requires manual investigation to determine whether the failure is real. In a CI pipeline that gates deployments on test passage, a flaky test can block a release until someone confirms the failure is non-deterministic and re-runs the pipeline. At scale, the aggregate engineering time spent on this pattern is significant.
Industry data shows that even a small percentage of flaky tests compounds quickly across a large number of daily CI runs. For teams at all scales, flakiness is a maintenance priority, not a cosmetic issue.
For background on building reliable test suites, see Astaqc's test automation services and the complete software testing guide.
Most flakiness falls into a small number of root cause categories. Understanding which category affects a specific test is the prerequisite for fixing it.
Timing and asynchrony. The most common cause of flakiness in browser tests. A test clicks a button before the resulting network request completes, asserts on text before a component re-renders, or checks for a modal that takes 200ms to animate in. The test passes when execution happens to be slow enough to wait; it fails when the CI machine is faster. Playwright's built-in element auto-waiting and Selenium's explicit WebDriverWait address this, but only when applied to every asynchronous operation — a single missing wait is enough to make a test non-deterministic.
Shared state between tests. Tests that modify shared database records, browser cookies, local storage, or cache without resetting after each run leave the environment in an unpredictable state for subsequent tests. The failure often appears in a different test than the one that caused it, making it particularly hard to diagnose. Each test should create its own state and clean it up, or run in a completely isolated environment.
Non-deterministic test data. Tests that depend on data that changes — current time, randomly assigned IDs, records created by other users in a shared staging environment — will fail when the data does not match expectations. A test that asserts the first item in a list is a specific record will fail whenever someone adds a new record that appears first. Tests should control their own data rather than depending on ambient state.
External service instability. Tests that call real third-party APIs (payment processors, email services, authentication providers) are subject to those services' availability and response time variation. A test that takes 30 seconds when a service is slow will fail a 25-second timeout. External service calls in tests should either be mocked or have explicit retry and timeout handling for known variability ranges.
Resource contention and environment variability. Tests that depend on available CPU, memory, or network bandwidth can fail on under-resourced CI runners. An absolute time wait that is sufficient on a developer laptop may be insufficient on a shared CI runner under heavy load. Element-condition waits are the solution; absolute time waits are the symptom.
Test order dependence. A test that assumes it will always run after a specific other test relies on ordering rather than isolation. Test suites that shuffle execution order expose this immediately; suites that always run in the same order hide it until parallelization is introduced.
Detection requires running the same test suite multiple times against an unchanged codebase and tracking which tests produce different results across runs. A test that passes 90 of 100 runs has a 10% flakiness rate; a test that fails once in 20 runs may go undetected for months.
Re-run failed tests automatically. Configure your CI system to re-run failing tests once before reporting a failure. A test that fails on the first run and passes on the second is almost certainly flaky. Many CI systems (GitHub Actions, CircleCI, Buildkite) support this natively. The re-run data, accumulated over weeks, produces a ranked list of your flakiest tests.
Run tests multiple times in parallel. Run the full suite several times concurrently and compare results. Tests that produce inconsistent outcomes across instances are flaky. This is the most direct detection method but requires CI budget for parallel execution.
Track failure rates over time. Instrumented CI pipelines that store test results with timestamps and run IDs reveal flakiness patterns. A test with a 15% failure rate over 100 CI runs is a known flaky test. Most test reporting tools (Allure, TestRail, Buildkite Insights, Cypress Cloud) provide this visibility.
Monitor for test order sensitivity. Run the test suite in random execution order and compare results to deterministic-order runs. Test order sensitivity is invisible in single-run or fixed-order pipelines and only manifests when execution order varies.
Isolate the test environment. Tests that pass locally and fail in CI, or pass in sequential execution and fail in parallel execution, are usually flaky due to environment differences or shared state. Reproducing CI conditions locally narrows the cause. See the manual vs automated testing guide and Astaqc's software testing services for structured approaches to test environment design.
Before eliminating flakiness, quarantine it. A quarantined test runs in CI but does not block deployment on failure. This removes the immediate pain — engineers stop losing time re-running pipelines — while preserving visibility into the flaky test's behavior for analysis.
Most test frameworks support a skip or pending mechanism. Some teams maintain a known-flaky list excluded from the blocking gate but included in a separate reporting job. Quarantined tests should have a deadline: a test that is quarantined indefinitely is a test that is never fixed. A two-week quarantine window with an assigned owner is a workable policy.
| Root Cause | Typical Symptom | Fix |
|---|---|---|
| Timing and async | Fails on fast CI, passes locally | Replace sleep/fixed waits with element-condition waits |
| Shared state | Fails after a specific other test runs | Reset database/storage after each test; use test isolation |
| Non-deterministic data | Fails when ambient data changes | Control test data creation; avoid ambient state assertions |
| External service calls | Fails intermittently on API timeout | Mock external services or add explicit retry logic with jitter |
| Resource contention | Fails only on loaded CI runners | Increase timeout budgets; move to dedicated CI runners |
| Test order dependence | Fails only in specific execution order | Enforce test isolation; run suite in randomized order |
| Selector instability | Fails after UI changes with no code change | Use stable selectors (data-testid, ARIA roles, visible text) |
Replace all absolute waits with condition waits. Every sleep or fixed-duration wait in a browser test is a red flag. The correct pattern is waiting for a specific element state: visible, enabled, containing expected text, or detached. Playwright's locator API auto-waits on all interaction and assertion calls. Selenium requires explicit WebDriverWait conditions. Time-based waits are almost always too short on slow machines and wasteful on fast ones.
Enforce test isolation at the data layer. Each test should set up the data it needs, run against that data, and tear it down. Database transactions rolled back after each test, or fresh database instances per test run, provide complete isolation. Tests that share a staging database with real users or other test runs are inherently non-deterministic unless data access is scoped to identifiers controlled by the test.
Mock external dependencies. Any test that calls a real external API depends on that service's availability and response time. Mock these calls at the HTTP layer using tools like nock, WireMock, or MSW (Mock Service Worker). The test validates the application's behavior given a specific response; the service's behavior is tested separately in integration tests that run less frequently.
Use stable selectors. CSS selectors that reference generated class names, DOM position, or framework-internal attributes break whenever the underlying implementation changes. Prefer data-testid or data-cy attributes placed explicitly for testing, ARIA roles and accessible names, or visible text. These attributes are stable across refactors because they express the element's purpose, not its implementation. See Astaqc's QA team service for expert support on selector strategy.
Delete tests that cannot be fixed. A test that has been flaky for six months with no path to a fix is not providing value. It is consuming CI time, developer attention, and blocking deployments. Delete it and replace it with a deterministic test that covers the same behavior through a different mechanism. The goal of a test suite is reliable signal, not maximum test count. See Astaqc's testing documentation services for help defining test standards.
Code review for test quality. Review test code with the same rigor as production code. Flag absolute waits, missing isolation, and external service calls without mocking. A checklist in pull request templates for test-relevant changes reduces the rate of known-bad patterns entering the suite.
Run new tests multiple times before merging. Require that new tests pass several consecutive CI runs before the PR is merged. A test that passes 5 of 5 runs has a lower probability of being flaky than one that passed once. This catches timing-sensitive tests before they enter the main suite.
Monitor flakiness rate as a team metric. Track flakiness rate (flaky failures as a percentage of total test runs) over time and set a threshold. Treat breaches as a maintenance priority. The software testing cost guide provides a framework for quantifying the cost of test suite unreliability.
Invest in test infrastructure. Dedicated CI runners with consistent resource allocation, containerized test environments with pinned dependencies, and database snapshots for fast isolation all reduce the environmental variability that enables timing and state-related flakiness. For guidance on test infrastructure strategy, see Astaqc's test automation services.
The most reliable method is to configure CI to re-run failed tests once and log both the first-run and second-run result. Tests that fail on the first run and pass on the second are flaky. After a few weeks of data, sort your test results by failure-then-pass rate to produce a ranked flakiness list. Most CI platforms support automatic re-run on failure natively.
In rare cases, yes — when a specific delay is a business requirement (a deliberate rate limit, an animation that must complete before the next interaction is valid, or a third-party script that injects content after a fixed delay). In all other cases, a fixed wait is a sign of missing element-condition waiting and should be replaced. Many teams enforce a lint rule that flags time-based waits in test code.
Retry logic is a useful quarantine strategy, not a fix. Automatically retrying a flaky test masks the symptom — the pipeline passes — but does not eliminate the non-determinism. In high-volume pipelines, a flaky test with automatic retry costs more CI compute than a reliable test. Fix the root cause; use retry as a short-term quarantine while you do.
CI-only failures usually indicate environment differences: different browser versions, different resource constraints, different network latency, or different database state. Start by comparing the CI environment specification against your local environment. Add verbose logging to the test to capture element state and page content at the point of failure. Run the test in a Docker container matching the CI image locally to reproduce the constraint.
A flaky test fails non-deterministically on the same codebase. A brittle test fails deterministically when the application changes in ways unrelated to the test's intended coverage — a renamed CSS class, a restructured DOM, a moved UI element. Flakiness is a timing or state isolation problem; brittleness is a selector stability and abstraction problem. Both degrade suite value, but they have different root causes and different fixes. See Astaqc's manual testing services and performance testing services.
Frame it in concrete cost. Count the number of CI pipeline re-runs caused by flakiness over the past month and multiply by the average pipeline runtime and the engineer-minutes consumed investigating each failure. A pipeline that re-runs 50 times per month with a 20-minute runtime and 10 minutes of investigation time per failure represents 25 hours of lost engineering time monthly — more than enough to justify a focused maintenance sprint. The software testing cost guide provides a framework for quantifying QA investment returns.

Sign up to receive and connect to our newsletter