Estimated Bug Density Metric

Last updated: March 30, 2026

The Estimated Bug Density (90d) metric asks: "For a given PR, how many bugs were introduced by it?" It is a code quality metric — lower values indicate higher code quality. This metric helps engineering teams understand how frequently merged code introduces bugs into production, relative to the amount of AI used to produce the original code.

How is Span deriving this metric?

Overview

Span automatically classifies every pull request your team authors into an investment category — new features, functionality improvements, documentation, or bug fixes. The Estimated Bug Defect metric uses this classification to measure the density of bugs in AI-generated code versus human-authored code, helping teams understand where defects originate and how to prevent them.

How it works

Estimating bug defects involves three distinct stages: classifying whether a PR is a bug fix, tracing that bug back to its root cause PR, and attributing the code in that root cause PR to AI or human authorship. Each stage applies a confidence threshold to ensure only high-signal results are surfaced.

Stage 1: Bug classification

Span's LLM follows an explicit decision tree to classify PRs. Before a PR can be counted as a bug fix, it is filtered against a set of exclusion criteria — test-only changes, refactors, feature work, dependency updates, and long-lived bugs are all excluded. The remaining PRs are evaluated using several signals:

A linked Jira ticket marked as a bug
A bug label applied to the pull request
Analysis of the diff content itself, which Span uses to infer the nature of the change even without explicit metadata

When the model is uncertain between "root cause" and "can't tell," it defaults to the latter. A PR is only classified as a bug fix when confidence reaches 0.7 or higher.

Stage 2: Root cause attribution

When a bug fix PR is detected, Span spins up an AI agent in a sandboxed environment that works backwards through the git history — using blame, logs, and diffs — to pinpoint every PR that could have introduced the issue. Root cause attribution only sticks when the agent's confidence reaches 0.7 or higher, so low-signal guesses are never surfaced to your team.

Stage 3: AI attribution

Once the root cause PR is identified, Span determines how much of that code was AI-generated using span-detect-1, an in-house ML model that achieves 95% accuracy in lab benchmarks. To avoid overstating precision, results are grouped into four tiers — none, low, medium, and high — rather than a raw percentage.

What you can see

Clicking into the Estimated Bug Defect score opens a detailed view of individual bugs, including:

A summary of what the bug was
The fix PR that resolved it
The root cause PR that introduced it
The AI attribution tier of that originating PR

How teams use this

Teams use the Estimated Bug Defect report to identify common themes in bugs that are escaping to production — and then work backward to ask: how could we have caught this earlier? Common follow-up actions include adding targeted linting rules, strengthening code review guardrails, or updating testing practices for the categories of bugs most frequently surfacing.

Metric definition

Est. Bug Density (90d) = Total Escaped Defects ÷ Total Weighted PRs

Numerator — Escaped Defects

Counts all defects linked back to a PR and discovered within 90 days of that PR being merged. This includes bugs attributed to AI-generated code, human-written code, or unclassified code.

Denominator — Weighted PRs

Rather than counting each PR equally, every PR is assigned a weight that combines complexity and size. See here for more information on Weighted PRs in Span.

The total weight for a PR is the sum of its complexity and size scores. This normalization prevents the metric from being skewed by teams that ship many small, trivial PRs versus teams that ship fewer, larger ones.

Filters Applied

Only pull requests from development contributors are included (administrative or non-engineering accounts are excluded)
PRs merged on out-of-office days are excluded

The 90-Day Observation Window

Defects are only counted if they are discovered within 90 days after the PR that caused them was merged. This window reflects the typical timeframe in which escaped bugs surface in production.

⚠ Note: Teams with slower defect discovery cycles (e.g., infrequent releases or delayed monitoring) may see lower-than-actual bug density since defects surfacing after 90 days are not captured.

Date Range Adjustment

When viewing this metric in a report, the selected date range is automatically shifted back by 90 days. This ensures the metric reflects PRs from the period when defects would realistically be discovered, rather than the discovery date itself.

But you can't answer that question the day it merges — you need to wait and see if defects surface. Span waits 90 days before considering a PR's defect count complete.

So the data you see at any point in time is always offset 90 days into the past:

Select "Last 2 weeks" → shows data from ~90–104 days ago
Select "Last 3 months" → shows data from ~90–180 days ago

Why the most recent 1–2 months are blank

A PR merged 60 days ago is still inside its observation window — it could still produce bugs over the next 30 days. Span won't include it in the metric until the full 90-day window has elapsed, because counting its defects now would be artificially low and misleading.

Additionally, Span requires that at least 70% of PRs in the selected period have completed their full 90-day window before surfacing the metric at all. If that threshold isn't met, it shows "Insufficient Data."

This is a deliberate accuracy-over-recency decision: publishing the metric prematurely (with incomplete defect counts) would make code quality appear better than it is, since most bugs haven't had time to surface yet. The 90-day lag ensures what you see reflects a true, complete picture.

Where to Find It

This metric appears in the AI Transformation Report, under the Quality section. It is displayed as a bar-line chart over time, allowing you to track trends in code quality at the team or organizational level.

Clicking into the metric opens a Detected Defects drilldown view showing:

Individual defects and the PRs they were attributed to
Bug-fix PRs
AI classification of the code (AI-generated vs. human-written)

Data Readiness Requirements

This metric requires Span's root-cause analysis pipeline to be active. This pipeline links discovered bugs back to the PRs that introduced them. The metric is only shown when sufficient data has been processed:

Requirement	Threshold
Minimum PRs analyzed	≥ 10 PRs
Minimum coverage	≥ 70% of PRs in the period have been fully evaluated

If these thresholds are not met, the metric displays an "Insufficient data" badge with the message:

"Insufficient number of PRs for analysis. Requires at least 10 PRs and ≥70% of PRs in the range analyzed."

This is intentional — publishing the metric before enough PRs have been processed would produce a misleading picture of code quality.

Interpreting the Metric

Value	Interpretation
Trending down	Code quality is improving — fewer bugs per unit of work
Trending up	Code quality may be declining — more escaped defects per PR
Stable and low	Healthy signal: the team is consistently shipping reliable code
"Insufficient data"	The analysis pipeline hasn't processed enough PRs yet for the selected period

Since the metric is weighted by complexity and size, a spike may indicate either more defective code or a period of unusually complex, high-risk work — use the drilldown view to investigate.

AI Breakdown

In the AI Transformation Report, you can view bug density broken down by AI code ratio buckets — comparing defect rates between PRs with high AI-generated code versus PRs with predominantly human-written code. This helps teams understand whether AI-assisted development is affecting code quality positively or negatively.

Prerequisites

To use this report, your organization must have:

A source control integration (e.g., GitHub, GitLab) with pull request data flowing into Span

Frequently Asked Questions

How do we identify PRs as Bug Fixes if they weren’t labeled?

We use a classification engine to categorize PRs and issues into areas like New Features, Maintenance, and Developer Experience. Within Maintenance, an LLM assigns a BugFix label when a PR matches “fixes a bug in existing functionality,” along with a confidence level (high, medium, low). For the Est. Bug Density metric, we only count PRs with high and medium confidence labels to reduce false positives, and this list is limited to bug-fix PRs that can be reliably linked to an PR within a 90-day window.

How do we link BugFix PRs to the Root Cause PRs?

Once a PR is identified as a bug fix, we run a multi-step analysis to find which prior PR introduced the bug within a 90 day period.
A second verification analyzes the candidate root cause PR and the bug fix PR and validates whether the bug fix addresses what the root cause PR introduced.

Why does the chart say “Includes PRs in Observation Window”?

If you select a timeframe that includes PRs in the 90-day observation window, we won’t show results because the data would be incomplete. You can click “Jump before window” to see results from before the observation window.

When I change the timeframe, I don’t see any data. Why?

To show meaningful results, we need each dosage category to meet these thresholds:

10 PRs merged in the selected timeframe
>70% of those PRs in the selected timeframe have completed their post-merge evaluation

What do you mean you look back 90 days?

We want to know how many defects came from a set of PRs. But we can't measure that the day a PR is merged so we need to wait and see if defects show up. So we wait 90 days.

The 90-day gap is the "wait and see" period. We only look at PRs old enough that we've had 90 days to observe whether they caused defects.

Here's how the math works, step by step:

Start with today's date
Go back 90 days. That's the end of the window (the most recent PR we can fairly evaluate)
Go back further by whatever range is selected. That's the start of the window

Example:

If you select Last 2 weeks, we go back 90 days, then look at the 2-week stretch before that. If you select Last 3 months, we go back 90 days, then look at the 3-month stretch before that.

Why "Last 3 months" doesn't mean the last 3 months: if you select "last 3 months," you might expect to see recent data. But because of the 90-day observation window, they're actually seeing PRs from ~3 to 6 months ago. That's by design. It's the only way to get accurate defect data.

Why does the chart say “Insufficient Data”?

The metric enforces two hard thresholds before displaying results:

Threshold	Requirement
Minimum PR count	At least 10 merged PRs in the selected period
Processing coverage	At least 70% of those PRs must have completed their 90-day observation window

The chart will show "Insufficient Data" if either condition isn't met. For example:

You selected a short timeframe with fewer than 10 PRs, or
Enough PRs exist but fewer than 70% have completed their post-merge evaluation period