Mean Time To Recover (MTTR) Report
Last updated: January 28, 2026
Overview
The Mean Time to Recover (MTTR) report measures how quickly your team responds to and resolves production incidents. This critical reliability metric tracks the average time elapsed from when an incident is created until it's marked as resolved, giving you insight into your team's incident response effectiveness and operational readiness.
Key Insight: Lower MTTR means faster recovery, reduced customer impact, and more time for your teams to focus on planned work.
What Does This Metric Tell You?
MTTR reveals several important aspects of your engineering organization:
🚨 Incident Response Speed - How quickly teams detect, diagnose, and fix production problems
💪 System Resilience - The stability and availability of your production systems
👥 Team Readiness - Your on-call capability and incident response processes
💼 Business Impact - How long your customers experience outages or degraded service
How It's Calculated
MTTR automatically calculates the time between incident creation and resolution using data from your integrated incident management system (PagerDuty, Opsgenie, etc.).
The report provides multiple views:
Average - The typical recovery time across all incidents
Median (P50) - The middle value (half resolve faster, half slower)
P75, P90 - Percentile views showing how your slowest incidents perform
Maximum - The longest recovery time in the period
💡 Pro Tip: Focus on the P90 value for realistic planning. While averages can be skewed by outliers, the P90 tells you how long 90% of your incidents resolve within.
Where to Find This Report
Navigation Path:
Productivity: DORA → MTTR Report
View By: Team/Group level to compare performance across your organization
Report Type: Trend charts showing MTTR over time
The MTTR report appears alongside complementary metrics like incident frequency and change failure rate, giving you a complete view of your reliability posture.
Interpreting Your Results
What's a Good MTTR?
MTTR Value | Performance Level | What It Means |
< 1 hour | 🟢 Excellent | Fast incident response with strong on-call capability |
1-4 hours | 🟡 Good | Quick resolution and effective diagnosis processes |
4-24 hours | 🟠 Fair | Reasonable response with opportunities to improve |
> 24 hours | 🔴 Needs Work | Slow resolution causing extended customer impact |
Key Questions to Ask
📈 Trending: Is your MTTR improving or degrading over time?
📊 Distribution: What's your P90? This shows how long the slowest 10% of incidents take—important for capacity planning and SLA commitments.
👥 By Team: Which teams resolve incidents faster? What can other teams learn from them?
🔗 Combined Context:
High MTTR + High Incident Frequency = System instability (focus on resilience)
High MTTR + Low Incident Frequency = Slow response process (focus on runbooks)
Low MTTR + High Incident Frequency = Need better prevention (focus on quality)
Low MTTR + Low Incident Frequency = Strong reliability (maintain and optimize)
Related Metrics
View MTTR alongside these complementary metrics for complete reliability insights:
Metric | What It Measures | Why It Matters |
Total Incidents | Raw count of production incidents | Shows volume of reliability issues |
Incidents per Week | How often incidents occur | Normalized frequency over time |
Change Failure Rate | Ratio of incidents to deployments | Reveals deployment quality |
Deployment Frequency | How often you ship to production | Balances velocity with stability |
Taking Action: How to Improve MTTR
1. Diagnose Root Causes
Analyze your slowest incidents:
Click into specific high-MTTR incidents in Span
Look for patterns: What services? What time of day? Which teams?
Break down the timeline: Detection time vs. diagnosis time vs. fix time
Compare across teams:
Teams with better MTTR likely have effective practices worth sharing
Identify teams that need additional support or training
Check if specialized services (databases, payments) have longer recovery times
2. Strengthen Incident Response
Build better runbooks:
Document common incident patterns and their solutions
Make runbooks searchable and keep them updated
Automate frequent remediation steps (service restarts, failovers)
Improve observability:
Invest in better metrics, logging, and distributed tracing
Create dashboards that show system health at a glance
Set up alerts that pinpoint problems, not just symptoms
Optimize on-call:
Ensure clear escalation paths and decision authority
Design schedules that prevent on-call fatigue
Provide teams with the tools and access they need to respond
3. Prevent Future Incidents
Learn systematically:
Conduct blameless post-mortems on significant incidents
Track and prioritize fixes based on frequency and impact
Share learnings across the organization
Build resilience:
Implement circuit breakers and graceful degradation
Reduce blast radius through better service isolation
Use canary deployments and gradual rollouts
4. Track Progress
Set realistic targets:
Establish your current baseline (P50 and P90)
Define achievable 3-6 month goals for each team
Review progress in team retrospectives and reviews
Make it visible:
Share Span dashboards with on-call teams and leadership
Include MTTR trends in team health discussions
Celebrate improvements and share successful practices
Best Practices
✅ Do's
Focus on trends - Is MTTR improving quarter over quarter?
Use percentiles - P90 is more realistic than average for planning
Look at context - Always review MTTR alongside incident frequency
Share learnings - Use Span data to identify and spread best practices
Measure outcomes - Did that new runbook actually reduce MTTR?
❌ Don'ts
Don't optimize the metric alone - The goal is minimizing customer impact, not hitting arbitrary targets
Don't ignore outliers - Very slow incidents might reveal systemic issues
Don't skip post-mortems - Fast recovery is good, but learning prevents recurrence
Don't blame teams - Use MTTR to identify process improvements, not to punish
The Bigger Picture: MTTR and Engineering Health
Improving MTTR has ripple effects across your engineering organization:
🧠 Reduced Context Switching - Engineers spend less time firefighting, more time building
😊 Better Team Morale - Fast incident resolution reduces stress and on-call burnout
🚀 Deployment Confidence - When you can recover quickly, teams ship with less anxiety
📚 Organizational Learning - Regular incident analysis builds institutional knowledge
Use Span's development analytics alongside MTTR to understand the full picture: Are teams with better MTTR also shipping features faster? Is deployment velocity improving alongside reliability?
Getting Started
Review your current MTTR in the DORA Metrics section
Identify your slowest incidents from the past quarter
Pick one improvement area (runbooks, observability, or prevention)
Set a realistic target and track progress over the next 3 months
Share wins with your team and organization
Need Help?
Questions about your data? Contact your Span account team
Want to dive deeper? Check out the related metrics in your DORA dashboard
Looking for best practices? Review high-performing teams in your organization using Span's team comparison views
This metric is part of Span's DORA Metrics suite, measuring the four key indicators of software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to recover.