How to Solve “Cannot Reproduce” Bugs That Cost Support Teams Hours
Mar 12, 2026
Support engineering teams frequently face a “visibility gap”: vague customer reports and incomplete data that lead to dreaded “cannot reproduce” (CNR) bugs. To scale, support must diagnose root causes in minutes without constantly escalating to developers.
In this article, we explore how to equip support engineers with the tools to achieve technical certainty at scale by leveraging runtime context and AI-driven reasoning.
————
The High Cost of “Cannot Reproduce”
In Support Engineering, cannot reproduce is the most expensive phrase we can hear. It represents exhausted capacity, frustrated customers, and shattered developer focus.
Studies suggest that 17% of tickets are closed after being marked “cannot reproduce.” When this happens:
- The customer stays frustrated,
- The Support Engineer loses credibility,
- And the developer’s focus is shattered by hunting for an issue that cannot be found.
Beyond immediate friction, these bugs create a hidden liability. They remain in production, increasing the “blast radius” of potential failures as more users hit the same edge case. As software instability climbs by nearly 10% as a follow-on consequence of AI-accelerated development, our investigation methods must evolve.
What is the Reproduction Tax?
The Reproduction Tax is the wasted engineering capacity spent trying to mimic production behavior in local or staging environments. Organizations that eliminate this tax can reduce Mean Time to Resolution (MTTR) from hours to minutes.
Why Modern Incident Resolution Is Broken?
- Environmental Mismatch: Production failures often rely on specific traffic patterns or data states that staging simply cannot mirror. This gap extends downtime, negatively impacting the customer experience and bleeding company revenue.
- Context gaps: Teams lack runtime context. This forces support into a cycle of reactive firefighting and guesswork. This forces support into a cycle of reactive firefighting, wasting engineering time on guesswork instead of innovation.
- AI agent limitations: AI agents face the same limitations as human engineers in incident resolution. Without runtime data, the AI’s effectiveness depends entirely on how well an engineer can guess the correct context in a prompt.
- Manual Data Collection: Legacy debugging requires new deployments just to collect diagnostic data. Every log-and-redeploy cycle inflates observability costs with limited ROI for the business.
- Developer dependency: Support teams are frequently paralyzed for days waiting on developer bandwidth to investigate tickets. This delay strains client relationships and increases the likelihood of churn. It also forces developers to prioritize tedious reproduction cycles over high-value feature development.
The Reality: Staging is not production. Modern distributed systems are too complex to replicate. We are forced to investigate live failures using static, historic logs rather than the active state of the system.
The 8-hour investigation: A best-case scenario
Even a simple “file upload failure” can consume a full engineering day. Compare the traditional workflow to an autonomous one:
- Report: A support engineer receives a report from a customer describing a failure.
- Context collection: We follow up to collect essential diagnostic information (e.g., file type, widespread vs local).
- Hypothesis testing: We try to eliminate potential causes by guessing at environmental variables (e.g., “does it only crash when the list exceeds 100 rows?”).
- Escalation: We escalate to engineering to check logs. If data was not captured, the trail goes cold.
- Redeploy: We have to redeploy code changes just to see if the root cause can be captured.
- Repeat: If the cause of the incident was not found, the cycle repeats.
| Timeline Point | Traditional Investigation Milestone |
| T+0h | Customer reports a 15MB CSV upload failure; generic error provided. |
| T+1h | Support engineer searches through logs. |
| T+2h | Escalated to engineering after support cannot determine the cause. |
| T+4h | Engineers check logs only to find generic UNKNOWN_ERROR. |
| T+6h | Developers add debug logging and redeploy code just to see the system state. |
| T+8h | Root cause finally identified; an entire workday and the developer’s focus is gone. |
The solution: AI-driven reasoning and Runtime context
To remove this lengthy process, we need to adopt autonomous issue resolution. We created AI site reliability engineering tools (AI SREs) for this purpose, but they need two core capabilities:
- AI-driven reasoning: The ability to analyze and correlate data across multiple observability vendors, APIs, and databases.
- Runtime Context: The “Source of Truth.” Instead of recreating a failure, teams capture the complete failure state where it happened.
We built Lightrun AI SRE with these two principle. Combining them support teams can understand issues, find and test mitigation suggestions and resolve many incidents without escalation to developer teams.
Introducing Lightrun AI SRE: Reliability across the SDLC
Reliability cannot start after an incident occurs. It must extend across the entire software development lifecycle. Engineers should be able to ask questions about their systems as they work, and support teams should be able to query behavior the moment it is flagged.
We built Lightrun AI SRE to transform investigations into a unified analysis layer. Because AI cannot resolve what it cannot see, Lightrun observes the system as it runs, safely instrumenting live environments without a redeploy.
Since AI cannot resolve what it cannot see, Lightrun AI SRE observes the system as it actually runs, safely instrumenting running systems and correlating context from multiple perspectives, unifying logs, metrics, traces, infrastructure signals, and change history.
By grounding this analysis in live execution state, Lightrun provides root cause analysis with a level of confidence that static code or disconnected logs simply cannot match.
This fundamentally changes how we can ensure reliability for our customers:
- Instant system understanding: Lightrun AI SRE explains how the system works, answers behavior questions, and clarifies configuration and architecture. This translates into fewer escalations, faster MTTR, and smoother customer onboarding.
- Failure classification: It enables faster triage and reduces the false escalations that clutter developer backlogs by distinguishing real bugs from setup or environment issues.
- Intelligent incident routing: By identifying whether an issue is application-level, infrastructure-related, or dependency-driven, it routes the incident to the relevant team immediately. This eliminates the “ping-pong” between departments and ensures clear ownership.
- Strengthen resilience: It provides tested, actionable remediation suggestions and improves long-term resilience with automated postmortems that ensure that the resolution of one ticket can prevent the next ten.
Case study: The bug that was actually a hard limit
In our own work here at Lightrun, we test out our own product whenever we face engineering issues. I like to approach root cause analysis like a detective, collecting all the evidence I can from logs, telemetry, code, and recent changes.
We had an interesting use case with a client. The issue turned out to be simple, but finding it was a real challenge. One of our APIs was returning an unexpectedly small amount of records to one of our customers, and they could not see all their entities in the plugin interface.
Initially, we were stuck. The API appeared functional and the database returned results, but something was clearly wrong. When we tried to reproduce it locally, everything worked perfectly. It turned out our local test data didn’t hit the specific volume thresholds present in production, allowing the bug to remain hidden.
We were building Lightrun AI SRE so we tested it out. It was only when we dived into the client’s production environment that the truth came out. Collecting live runtime context, the AI agent placed a snapshot right after the database query but before the return to the user. Without escalating to a developer, it set an expression to compare the rows returned from the database vs. those from the API, and the live context revealed the discrepancy instantly:
The client was calling 100 records from the database, but the API could only return 20 due to a wrong REST controller configuration.
It wasn’t a ghost; it was a hard-limit annotation in the code.
While the root cause was visible in the code once we knew where to look, the “killer feature” was being able to get the data to prove it without a single reproduction attempt. Having the AI SRE agent dynamically collect this evidence from the actual environment satisfying gave us the technical certainty to implement a long-term solution, using search and pagination, in minutes. Impressively we could do all of this without a single redeploy.
The new support investigation workflow
Using Lightrun AI SRE we’ve adopted an automated five-step workflow that lets our team work without the Reproduction Tax:
- Triage automatically: Lightrun AI SRE connects reports to live system signals directly within tools like Slack or PagerDuty to determine if a behavior is a user error or a bug.
- Assess impact: It identifies failing services and the exact percentage of users impacted in real-time.
- Eliminate unknowns: The AI agent dynamically tests hypotheses by correlating runtime snapshots and environment variables to rule out common culprits like configuration drifts or local data mismatches.
- Prove root cause: It investigates directly in production systems to capture live variable values without manually checking code.
- Convert to knowledge: We then take all the investigation details, automatically update the initial Jira ticket to explain the event to engineers alongside technical proof like logs and snapshots.
- Propose and validate fixes: We can then suggest product changes to prevent recurrence, and using the Lightrun MCP the AI code agent engineers can generate and validate the changes, before updating the customer of a successful resolution.
Switch to 4-minute incident resolution
Our goal isn’t just to close tickets faster, it’s to end the “Ping-Pong” between departments that wastes our engineering bandwidth. By removing the Reproduction Tax, enterprise teams at organizations like Taboola and AT&T have achieved a 90% reduction in MTTR.
When your team has access to live production evidence, you move beyond the limits of reactive triage. You stop simply routing tickets and start delivering the technical certainty required to resolve incidents in minutes, not days.
Explore Lightrun AI SRE
Frequently asked questions
Yes. Lightrun uses dynamic read-only instrumentation designed for production systems with a negligible performance footprint. Engineers can observe variable values and execution paths in real time without restarting services or impacting the user experience.
By eliminating reproduction cycles and log-and-redeploy debugging loops, Lightrun helps teams identify root causes significantly faster. Enterprise teams, including organizations such as Taboola and AT&T, have reported up to a 90% reduction in MTTR, reducing complex investigations from several hours to minutes.
Yes. Lightrun AI SRE allows support teams to investigate issues directly in production systems by capturing runtime context and analyzing system behavior in real time. This enables support engineers to identify root causes and resolve many incidents without escalating to development teams.