TL;DR: A flaky test is an automated test that passes and fails intermittently on the same code without any changes. Google’s testing research found that 84% of test transitions from pass to fail in their CI system were caused by flaky tests, not actual code regressions. Atlassian reports wasting over 150,000 developer hours per year investigating flaky failures. This guide explains why flaky tests happen, how to detect them, and practical fixes that actually work, including how AI self healing eliminates the most common root cause.

Definition: Flaky Test An automated test that produces inconsistent results (sometimes passes, sometimes fails) when executed against the same code revision under the same conditions. Flaky tests are caused by non deterministic factors such as timing dependencies, race conditions, test order coupling, shared state, network variability, or brittle element selectors. The key characteristic: the failure does not correlate with a code change. The same commit can produce a pass on one run and a fail on the next.

Quick Answers:

What causes flaky tests? The five most common causes are: async timing issues (test does not wait long enough for an operation to complete), race conditions in concurrent code, shared state between tests (one test modifies data another test depends on), environment differences (test passes locally but fails in CI), and brittle UI selectors (element IDs or class names change between builds). Academic research across 201 Apache project fixes found 45% of flakiness comes from async wait issues alone.

How much do flaky tests cost? Google found that at a 1.5% per run flake rate with 1,000 tests, approximately 15 tests fail in every release cycle, each requiring investigation. Atlassian documented 150,000+ developer hours per year lost to flakiness. Slack Engineering reported their main branch had only a 20% pass rate before automated flaky test handling, with 57% of build failures caused by flaky test jobs.

Can AI fix flaky tests? Yes, for the most common causes. AI self healing (like ContextQA’s AI based self healing) eliminates selector based flakiness by automatically identifying elements through multiple strategies when the primary locator breaks. AI failure classification (like ContextQA’s root cause analysis) separates flaky failures from real bugs, so developers stop wasting time investigating false alarms.

The Scale of the Flaky Test Problem

I need to start with numbers because most teams underestimate how much flakiness actually costs them.

Google’s testing blog published the most widely cited data on test flakiness. Almost 16% of all tests at Google exhibit some level of flakiness. When a test transitions from passing to failing in their post submit CI system, 84% of the time it is a flaky test, not a real regression. Read that again. 84% of failures are noise, not signal.

The math scales badly. At a modest 1.5% per run flake rate across a project with 1,000 tests, roughly 15 tests will show red in any given release cycle. Each one demands investigation: is this a real bug or a false alarm? At 30 to 90 minutes per investigation, that is 7 to 22 hours of developer time per release cycle spent chasing ghosts.

Atlassian published data in December 2025 showing flakiness wastes over 150,000 developer hours per year across their engineering organization. In the Jira Frontend repository specifically, flaky tests were responsible for up to 21% of master build failures.

Slack Engineering revealed even more dramatic numbers. Before they built automated flaky test detection, their main branch had a 20% pass rate. Not 20% failure rate. 20% pass rate. Of failing builds, 57% failed due to test job failures (not compilation, not deployment, tests). After implementing automated detection and suppression, false failures dropped to under 4%.

And this problem is getting worse, not better. An SD Times analysis of Bitrise data across 10 million+ mobile builds from January 2022 to June 2025 found the proportion of teams experiencing flaky tests grew from 10% to 26%. That is a 160% increase in three years.

A February 2025 academic study on GitHub Actions analyzing 1,960 open source Java projects found 51.28% of all projects are affected by flaky builds, and 67.73% of rerun builds exhibit flaky behavior.

ContextQA was built to address exactly this problem. The IBM ContextQA case study explicitly documents flakiness elimination after migrating to the platform’s AI engine. When your tests stop lying to you, your CI pipeline becomes a reliable quality signal instead of a noise generator.

The Five Root Causes of Flaky Tests (With Fixes)

I am going to be specific about each cause because “your tests are flaky” is not actionable. Knowing exactly why they are flaky is.

Cause 1: Async Wait Issues (45% of All Flakiness)

Academic research analyzing 201 fixes across 51 Apache projects (Luo et al., FSE 2014) found that 45% of flaky test fixes addressed async timing issues. This is by far the most common cause.

What happens: Your test clicks a button, then immediately checks if the result appeared. Sometimes the server responds in 50ms and the test passes. Sometimes it takes 200ms and the test fails because the element has not rendered yet.

The fix that does not work: Adding sleep(2000). Hardcoded waits make tests slow without making them reliable. If the response takes 2.5 seconds, the test still fails. If it takes 50ms, you wasted 1,950ms.

The fix that works: Explicit waits with conditions. Wait until the specific element is visible, the network request completes, or the loading spinner disappears. Modern frameworks support this natively.

ContextQA’s web automation handles this automatically through AI powered auto waiting that detects when the application is ready for the next interaction, not based on arbitrary time delays but on actual DOM state and network activity.

Cause 2: Brittle UI Selectors (30% of Maintenance Flakiness)

This one drives me crazy because it is entirely preventable.

What happens: Your test locates a button using a CSS class like .btn-primary-v2-updated or an auto generated ID like #ember1847. A developer refactors the CSS, and suddenly every test that depends on that selector fails. The button still works perfectly. The test just cannot find it.

The fix: Use stable locators. Data test attributes (data-testid=”checkout-button”), accessible roles (role=”button” name=”Checkout”), or text content (“Checkout”). These survive CSS refactors because they are tied to functionality, not styling.

But even with the best locator strategy, selectors break. That is why ContextQA’s AI based self healing maintains a multi layered element fingerprint. When the primary selector fails, the AI tries visual matching, accessibility ID, text content, relative DOM position, and surrounding context. It finds the element and updates the test automatically. No human intervention. No broken build.

Cause 3: Shared State Between Tests (15% of Flakiness)

What happens: Test A creates a user record. Test B assumes that record exists. When Test A runs first, Test B passes. When the test runner executes them in a different order (or in parallel), Test B fails because the user record does not exist yet.

The fix: Each test creates its own data and cleans up after itself. No test should depend on another test’s output. If you need shared test data, use a dedicated test data setup phase that runs before the entire suite, not within individual tests. ContextQA’s AI data validation ensures test data consistency and completeness across environments.

Cause 4: Environment Differences (CI vs Local)

What happens: Tests pass on a developer’s laptop but fail in the CI pipeline. The causes are usually: different browser versions, different screen resolutions affecting visual tests, different timezone settings, different database states, or resource constraints (CI containers typically have less CPU and memory than developer machines).

The fix: Run tests in containers that match your CI environment. Lock browser versions. Use headless mode consistently. Set explicit timezone and locale configurations. ContextQA’s digital AI continuous testing runs tests in consistent, production like environments through all integrations with Jenkins, GitHub Actions, GitLab CI, and CircleCI.

Cause 5: Race Conditions and Concurrency (10% of Flakiness)

What happens: Two operations execute simultaneously and occasionally interfere with each other. A database write and a database read happening at the same time. Two UI events firing in unpredictable order. A web socket message arriving between the click and the assertion.

The fix: This is the hardest category to fix because race conditions are inherently non deterministic. The most effective approach is to isolate test environments (each test gets its own database state), serialize operations that have ordering dependencies, and use eventual consistency patterns (retry assertions for a bounded period). ContextQA’s root cause analysis specifically classifies concurrency related failures, so teams can identify patterns rather than chasing individual incidents.

The Flaky Test Detection and Response Framework

Here is the practical framework that works at scale, based on approaches validated at Google, Atlassian, and Slack.

StepActionToolTime
1. DetectTrack test results across runs. A test that fails intermittently on the same commit is flaky.CI dashboard, custom analysisAutomated
2. QuarantineMove confirmed flaky tests to a separate suite that runs but does not block the pipeline.CI pipeline configuration5 min per test
3. ClassifyDetermine root cause: async, selector, shared state, environment, or concurrencyContextQA root cause analysisAutomated
4. FixApply the targeted fix for the specific root cause categoryDevelopment work30 min to 4 hrs
5. MonitorTrack flakiness rate as a metric. Set a budget: max 2% flake rate. Alert when exceeded.ContextQA AI insightsAutomated

The critical principle: Do not ignore flaky tests. Do not delete them. Quarantine them, fix them, and return them to the active suite. A deleted flaky test is a coverage gap. A quarantined flaky test is a known issue with a fix timeline.

ContextQA’s Approach to Flakiness Elimination

ContextQA addresses flakiness at three levels simultaneously.

Level 1: Prevention. AI based self healing prevents the most common source of flakiness (brittle selectors) by maintaining multiple element identification strategies that update automatically when UI changes occur. The IBM ContextQA case study documents flakiness elimination after migration.

Level 2: Detection. Root cause analysis classifies every failure automatically: real bug, selector issue, timing issue, environment problem, or transient flake. QA teams see classified failures, not undifferentiated red builds.

Level 3: Measurement. AI insights and analytics tracks flakiness metrics over time: flake rate per test, per module, per environment. Teams can set budgets (max 2% flake rate) and get alerts when flakiness spikes.

G2 verified reviews show the combined impact: 50% regression time reduction and 80% automation rates. A significant portion of that reduction comes from eliminating false failures that previously consumed investigation time.

Deep Barot, CEO and Founder of ContextQA, described the platform’s philosophy on testing reliability in a DevOps.com interview: running the right test at the right time. A flaky test is never the “right test” because its results cannot be trusted. Eliminating flakiness is a prerequisite for intelligent test selection.

Limitations and What AI Cannot Fix

Race conditions in production code are not a testing problem. If your application has a race condition, fixing the test is the wrong response. Fix the application. AI can detect that the failure is concurrency related, but the fix requires understanding the business logic.

Some flakiness is signal, not noise. A test that fails intermittently due to a timeout might be revealing a real performance problem: the server is occasionally slow under load. Before quarantining a flaky test, verify that it is not catching a real (intermittent) bug.

Zero flakiness is not a realistic goal. Even Google, with massive investment in test infrastructure, reports 16% of tests exhibiting some flakiness. The goal is managing flakiness to a budget (under 2% per run), not eliminating it entirely. The difference between 20% flake rate and 2% flake rate is the difference between a useless CI pipeline and a reliable one.

Team culture matters more than tooling. The biggest companies fighting flakiness (Google, Atlassian, Slack) all report that cultural norms are as important as technical fixes. If your team’s response to a flaky test is “just re run it,” you are building a re run culture that will compound over time. Flaky tests need owners. They need SLAs for resolution. They need visibility in sprint retrospectives. Without organizational accountability, even the best AI detection tools will produce reports that nobody acts on. The teams that successfully manage flakiness treat the flake rate as a first class engineering metric alongside build time, deployment frequency, and change failure rate. ContextQA’s AI insights dashboard makes this metric visible to the entire team, not buried in CI logs.

Do This Now Checklist

  1. Measure your current flake rate (10 min). Check your last 10 CI runs. How many tests failed that were not real bugs? That percentage is your flake rate. If it is over 5%, flakiness is actively degrading your team’s productivity.
  2. Identify your top 5 flakiest tests (10 min). Which specific tests fail most often without code changes? Those are your highest ROI fixes.
  3. Replace hardcoded waits with explicit waits (20 min). Search your test code for sleep(), wait(), or hardcoded timeout values. Replace each with a condition based wait.
  4. Add data test attributes to critical UI elements (15 min). Add data-testid to your top 10 most tested elements. This stabilizes selectors immediately.
  5. Enable self healing for UI tests (15 min). ContextQA’s AI based self healing eliminates selector based flakiness automatically.
  6. Start a ContextQA pilot (15 min). Benchmark flakiness reduction alongside automation improvement over 12 weeks.

Conclusion

Flaky tests are the single biggest productivity drain in automated testing. 84% of CI failures at Google are flaky, not real. Atlassian loses 150,000+ developer hours per year. Slack’s main branch had a 20% pass rate before addressing flakiness.

The fix is not manual. AI self healing prevents selector flakiness. Root cause analysis classifies failures automatically. Analytics track flakiness as a measurable metric with budgets and alerts.

ContextQA’s AI based self healing, automated root cause analysis, and real time analytics platform eliminates flakiness at prevention, detection, and measurement levels simultaneously.

Book a demo to see how ContextQA eliminates flaky tests in your CI pipeline.

Frequently Asked Questions

A flaky test passes and fails intermittently on the same code without any changes. It is caused by non deterministic factors like timing issues, brittle selectors, shared state, or environment differences. Google found 84% of CI pass to fail transitions are flaky tests, not real regressions.
Three factors: faster CI/CD pipelines expose timing sensitivity, complex microservices introduce more environmental variability, and AI generated code increases test volume without proportional maintenance investment. Bitrise data shows teams experiencing flakiness grew from 10% to 26% in three years.
Track test results across multiple runs on the same commit. Any test that passes on one run and fails on another without code changes is flaky. Automated detection compares results across 3 to 5 runs and flags inconsistencies. ContextQA's root cause analysis classifies failures automatically.
Async timing issues account for 45% of all flakiness. The test does not wait long enough for an operation to complete. The fix: replace hardcoded sleep() calls with explicit condition based waits that check for specific DOM states or network completion.
Yes, for the most common UI test flakiness cause (brittle selectors). AI self healing maintains multiple element identification strategies and automatically updates when the primary locator breaks due to CSS refactoring, DOM restructuring, or dynamic ID changes. The IBM ContextQA case study documents flakiness elimination as a direct result of self healing implementation. Teams that previously spent 40% to 70% of their automation effort on selector maintenance report near zero maintenance after enabling self healing through ContextQA.

Smarter QA that keeps your releases on track

Build, test, and release with confidence. ContextQA handles the tedious work, so your team can focus on shipping great software.

Book A Demo