Mean Time To Recover (MTTR) Report

Last updated: January 28, 2026

Overview

The Mean Time to Recover (MTTR) report measures how quickly your team responds to and resolves production incidents. This critical reliability metric tracks the average time elapsed from when an incident is created until it's marked as resolved, giving you insight into your team's incident response effectiveness and operational readiness.

Key Insight: Lower MTTR means faster recovery, reduced customer impact, and more time for your teams to focus on planned work.

What Does This Metric Tell You?

MTTR reveals several important aspects of your engineering organization:

🚨 Incident Response Speed - How quickly teams detect, diagnose, and fix production problems
💪 System Resilience - The stability and availability of your production systems
👥 Team Readiness - Your on-call capability and incident response processes
💼 Business Impact - How long your customers experience outages or degraded service

How It's Calculated

MTTR automatically calculates the time between incident creation and resolution using data from your integrated incident management system (PagerDuty, Opsgenie, etc.).

The report provides multiple views:

Average - The typical recovery time across all incidents
Median (P50) - The middle value (half resolve faster, half slower)
P75, P90 - Percentile views showing how your slowest incidents perform
Maximum - The longest recovery time in the period

💡 Pro Tip: Focus on the P90 value for realistic planning. While averages can be skewed by outliers, the P90 tells you how long 90% of your incidents resolve within.

Where to Find This Report

Navigation Path:

Productivity: DORA → MTTR Report
View By: Team/Group level to compare performance across your organization
Report Type: Trend charts showing MTTR over time

The MTTR report appears alongside complementary metrics like incident frequency and change failure rate, giving you a complete view of your reliability posture.

Interpreting Your Results

What's a Good MTTR?

MTTR Value	Performance Level	What It Means
< 1 hour	🟢 Excellent	Fast incident response with strong on-call capability
1-4 hours	🟡 Good	Quick resolution and effective diagnosis processes
4-24 hours	🟠 Fair	Reasonable response with opportunities to improve
> 24 hours	🔴 Needs Work	Slow resolution causing extended customer impact

Key Questions to Ask

📈 Trending: Is your MTTR improving or degrading over time?

📊 Distribution: What's your P90? This shows how long the slowest 10% of incidents take—important for capacity planning and SLA commitments.

👥 By Team: Which teams resolve incidents faster? What can other teams learn from them?

🔗 Combined Context:

High MTTR + High Incident Frequency = System instability (focus on resilience)
High MTTR + Low Incident Frequency = Slow response process (focus on runbooks)
Low MTTR + High Incident Frequency = Need better prevention (focus on quality)
Low MTTR + Low Incident Frequency = Strong reliability (maintain and optimize)

Related Metrics

View MTTR alongside these complementary metrics for complete reliability insights:

Metric	What It Measures	Why It Matters
Total Incidents	Raw count of production incidents	Shows volume of reliability issues
Incidents per Week	How often incidents occur	Normalized frequency over time
Change Failure Rate	Ratio of incidents to deployments	Reveals deployment quality
Deployment Frequency	How often you ship to production	Balances velocity with stability

Taking Action: How to Improve MTTR

1. Diagnose Root Causes

Analyze your slowest incidents:

Click into specific high-MTTR incidents in Span
Look for patterns: What services? What time of day? Which teams?
Break down the timeline: Detection time vs. diagnosis time vs. fix time

Compare across teams:

Teams with better MTTR likely have effective practices worth sharing
Identify teams that need additional support or training
Check if specialized services (databases, payments) have longer recovery times

2. Strengthen Incident Response

Build better runbooks:

Document common incident patterns and their solutions
Make runbooks searchable and keep them updated
Automate frequent remediation steps (service restarts, failovers)

Improve observability:

Invest in better metrics, logging, and distributed tracing
Create dashboards that show system health at a glance
Set up alerts that pinpoint problems, not just symptoms

Optimize on-call:

Ensure clear escalation paths and decision authority
Design schedules that prevent on-call fatigue
Provide teams with the tools and access they need to respond

3. Prevent Future Incidents

Learn systematically:

Conduct blameless post-mortems on significant incidents
Track and prioritize fixes based on frequency and impact
Share learnings across the organization

Build resilience:

Implement circuit breakers and graceful degradation
Reduce blast radius through better service isolation
Use canary deployments and gradual rollouts

4. Track Progress

Set realistic targets:

Establish your current baseline (P50 and P90)
Define achievable 3-6 month goals for each team
Review progress in team retrospectives and reviews

Make it visible:

Share Span dashboards with on-call teams and leadership
Include MTTR trends in team health discussions
Celebrate improvements and share successful practices

Best Practices

✅ Do's

Focus on trends - Is MTTR improving quarter over quarter?
Use percentiles - P90 is more realistic than average for planning
Look at context - Always review MTTR alongside incident frequency
Share learnings - Use Span data to identify and spread best practices
Measure outcomes - Did that new runbook actually reduce MTTR?

❌ Don'ts

Don't optimize the metric alone - The goal is minimizing customer impact, not hitting arbitrary targets
Don't ignore outliers - Very slow incidents might reveal systemic issues
Don't skip post-mortems - Fast recovery is good, but learning prevents recurrence
Don't blame teams - Use MTTR to identify process improvements, not to punish

The Bigger Picture: MTTR and Engineering Health

Improving MTTR has ripple effects across your engineering organization:

🧠 Reduced Context Switching - Engineers spend less time firefighting, more time building
😊 Better Team Morale - Fast incident resolution reduces stress and on-call burnout
🚀 Deployment Confidence - When you can recover quickly, teams ship with less anxiety
📚 Organizational Learning - Regular incident analysis builds institutional knowledge

Use Span's development analytics alongside MTTR to understand the full picture: Are teams with better MTTR also shipping features faster? Is deployment velocity improving alongside reliability?

Getting Started

Review your current MTTR in the DORA Metrics section
Identify your slowest incidents from the past quarter
Pick one improvement area (runbooks, observability, or prevention)
Set a realistic target and track progress over the next 3 months
Share wins with your team and organization

Need Help?

Questions about your data? Contact your Span account team
Want to dive deeper? Check out the related metrics in your DORA dashboard
Looking for best practices? Review high-performing teams in your organization using Span's team comparison views

This metric is part of Span's DORA Metrics suite, measuring the four key indicators of software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to recover.