Mean Time To Recover (MTTR) Report

Last updated: January 28, 2026

Overview

The Mean Time to Recover (MTTR) report measures how quickly your team responds to and resolves production incidents. This critical reliability metric tracks the average time elapsed from when an incident is created until it's marked as resolved, giving you insight into your team's incident response effectiveness and operational readiness.

Key Insight: Lower MTTR means faster recovery, reduced customer impact, and more time for your teams to focus on planned work.


What Does This Metric Tell You?

MTTR reveals several important aspects of your engineering organization:

  • 🚨 Incident Response Speed - How quickly teams detect, diagnose, and fix production problems

  • 💪 System Resilience - The stability and availability of your production systems

  • 👥 Team Readiness - Your on-call capability and incident response processes

  • 💼 Business Impact - How long your customers experience outages or degraded service


How It's Calculated

MTTR automatically calculates the time between incident creation and resolution using data from your integrated incident management system (PagerDuty, Opsgenie, etc.).

The report provides multiple views:

  • Average - The typical recovery time across all incidents

  • Median (P50) - The middle value (half resolve faster, half slower)

  • P75, P90 - Percentile views showing how your slowest incidents perform

  • Maximum - The longest recovery time in the period

💡 Pro Tip: Focus on the P90 value for realistic planning. While averages can be skewed by outliers, the P90 tells you how long 90% of your incidents resolve within.


Where to Find This Report

Navigation Path:

  • Productivity: DORA → MTTR Report

  • View By: Team/Group level to compare performance across your organization

  • Report Type: Trend charts showing MTTR over time

The MTTR report appears alongside complementary metrics like incident frequency and change failure rate, giving you a complete view of your reliability posture.


Interpreting Your Results

What's a Good MTTR?

MTTR Value

Performance Level

What It Means

< 1 hour

🟢 Excellent

Fast incident response with strong on-call capability

1-4 hours

🟡 Good

Quick resolution and effective diagnosis processes

4-24 hours

🟠 Fair

Reasonable response with opportunities to improve

> 24 hours

🔴 Needs Work

Slow resolution causing extended customer impact

Key Questions to Ask

📈 Trending: Is your MTTR improving or degrading over time?

📊 Distribution: What's your P90? This shows how long the slowest 10% of incidents take—important for capacity planning and SLA commitments.

👥 By Team: Which teams resolve incidents faster? What can other teams learn from them?

🔗 Combined Context:

  • High MTTR + High Incident Frequency = System instability (focus on resilience)

  • High MTTR + Low Incident Frequency = Slow response process (focus on runbooks)

  • Low MTTR + High Incident Frequency = Need better prevention (focus on quality)

  • Low MTTR + Low Incident Frequency = Strong reliability (maintain and optimize)


Related Metrics

View MTTR alongside these complementary metrics for complete reliability insights:

Metric

What It Measures

Why It Matters

Total Incidents

Raw count of production incidents

Shows volume of reliability issues

Incidents per Week

How often incidents occur

Normalized frequency over time

Change Failure Rate

Ratio of incidents to deployments

Reveals deployment quality

Deployment Frequency

How often you ship to production

Balances velocity with stability


Taking Action: How to Improve MTTR

1. Diagnose Root Causes

Analyze your slowest incidents:

  • Click into specific high-MTTR incidents in Span

  • Look for patterns: What services? What time of day? Which teams?

  • Break down the timeline: Detection time vs. diagnosis time vs. fix time

Compare across teams:

  • Teams with better MTTR likely have effective practices worth sharing

  • Identify teams that need additional support or training

  • Check if specialized services (databases, payments) have longer recovery times

2. Strengthen Incident Response

Build better runbooks:

  • Document common incident patterns and their solutions

  • Make runbooks searchable and keep them updated

  • Automate frequent remediation steps (service restarts, failovers)

Improve observability:

  • Invest in better metrics, logging, and distributed tracing

  • Create dashboards that show system health at a glance

  • Set up alerts that pinpoint problems, not just symptoms

Optimize on-call:

  • Ensure clear escalation paths and decision authority

  • Design schedules that prevent on-call fatigue

  • Provide teams with the tools and access they need to respond

3. Prevent Future Incidents

Learn systematically:

  • Conduct blameless post-mortems on significant incidents

  • Track and prioritize fixes based on frequency and impact

  • Share learnings across the organization

Build resilience:

  • Implement circuit breakers and graceful degradation

  • Reduce blast radius through better service isolation

  • Use canary deployments and gradual rollouts

4. Track Progress

Set realistic targets:

  • Establish your current baseline (P50 and P90)

  • Define achievable 3-6 month goals for each team

  • Review progress in team retrospectives and reviews

Make it visible:

  • Share Span dashboards with on-call teams and leadership

  • Include MTTR trends in team health discussions

  • Celebrate improvements and share successful practices


Best Practices

Do's

  • Focus on trends - Is MTTR improving quarter over quarter?

  • Use percentiles - P90 is more realistic than average for planning

  • Look at context - Always review MTTR alongside incident frequency

  • Share learnings - Use Span data to identify and spread best practices

  • Measure outcomes - Did that new runbook actually reduce MTTR?

Don'ts

  • Don't optimize the metric alone - The goal is minimizing customer impact, not hitting arbitrary targets

  • Don't ignore outliers - Very slow incidents might reveal systemic issues

  • Don't skip post-mortems - Fast recovery is good, but learning prevents recurrence

  • Don't blame teams - Use MTTR to identify process improvements, not to punish


The Bigger Picture: MTTR and Engineering Health

Improving MTTR has ripple effects across your engineering organization:

  • 🧠 Reduced Context Switching - Engineers spend less time firefighting, more time building

  • 😊 Better Team Morale - Fast incident resolution reduces stress and on-call burnout

  • 🚀 Deployment Confidence - When you can recover quickly, teams ship with less anxiety

  • 📚 Organizational Learning - Regular incident analysis builds institutional knowledge

Use Span's development analytics alongside MTTR to understand the full picture: Are teams with better MTTR also shipping features faster? Is deployment velocity improving alongside reliability?


Getting Started

  1. Review your current MTTR in the DORA Metrics section

  2. Identify your slowest incidents from the past quarter

  3. Pick one improvement area (runbooks, observability, or prevention)

  4. Set a realistic target and track progress over the next 3 months

  5. Share wins with your team and organization


Need Help?

  • Questions about your data? Contact your Span account team

  • Want to dive deeper? Check out the related metrics in your DORA dashboard

  • Looking for best practices? Review high-performing teams in your organization using Span's team comparison views


This metric is part of Span's DORA Metrics suite, measuring the four key indicators of software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to recover.