Use case

Implementing Lightrun to get richer context when debugging live production incidents


StartApp improved the efficiency of their incident resolution process and significantly decreased MTTR
On the edge of tech and innovation, StartApp has compiled an unmatched volume of intelligent mobile data insights to help content providers understand user behavior and needs. With 160 employees and 6 offices globally, StartApp partners with over 350,000 leading applications and works with a database of over 1 billion mobile users worldwide.

The Challenge

Juggling loads of traffic and its unknowns

StartApp handles more than 30 billion requests every day. Handling this enormous amount of traffic is an uphill battle – one that is fraught with complicated, nuanced production issues.

Two key types of production issues that come with this level of scale are concurrency and parallelism problems. These types of issues appear only under a specific set of circumstances and are often very hard to reliably replicate locally. When these issues happen and are left unmitigated, they tend to lead to severe service disruptions – often causing data corruption and non-standard program behavior. 

Software-defined caches are another pain point for StartApp’s developers. Some types of caches must remain immutable – that is, once a value is inserted into the cache it must not change. The developers discovered, however, that there are specific configurations under which the values in the cache can indeed change. This results in an unfortunate situation –  when the cache is being “hit”, the information returned from it is not the same information the developers expect to see.

Investigating these types of issues is also very difficult since the cache gets “dirty” non-deterministically. In other words – while it’s relatively easy to identify that the information returned from the cache is incorrect, it’s hard to know which activities make the cache “dirty”.

“Using Lightrun, we were able to dive deep into production issues instantly, with a single action. By reducing the number of steps in the debugging process, Lightrun helped us reduce our MTTR by 50%-60%. Lightrun is definitely an ideal addition to every company’s toolbox, and is especially helpful in investigating hard-to-replicate production issues.”

Boris Shmerlin, Director of Advertising R&D at StartApp

The Solution

Leveraging Lightrun for real-time debugging, monitoring, and alerting without ever leaving the IDE

When StartApp learned of Lightrun’s approach to production debugging, they were immediately intrigued. They have, after all, been hard at work trying to break apart the exact issues Lightrun claims to solve. After a short exploration period, StartApp deployed Lightrun to a significant portion of their production services.

Previously, when developers wanted to add more visibility when a specific event occurs in a running application, they had to:

  1. Add a new piece of code that exposes some piece of information
  2. Pour the information produced into an external system, for example, Kibana 
  3. Review the information in the external system

Lightrun eliminates this entire process, by opting instead to take a more proactive approach. Using Lightrun, StartApp’s developers now define conditions that determine when the event at hand should occur. Then, when it does, the developers get proactive alerts inside their IDE with all the required information. When the specific case is being “caught” (i.e. when a specific condition is being met), Lightrun automatically pipes the information right to them.

This new process is especially handy when debugging the previously mentioned cache issues. By placing Lightrun Snapshots on the relevant parts of the cache and inspecting the stack trace, the developers can now identify the exact flow that caused the cache to misbehave. Without Lightrun, capturing the same stack trace would take significantly longer – resulting in slower MTTR and a decreased quality of service for their customers.

Getting a better grip on issues that appear only under specific circumstances is also a breeze using Lightrun. By placing a Lightrun agent in each of their data centers, StartApp’s developers were able to identify – in real-time – issues that are isolated to a single data center and resolve them significantly faster.

The Results

50%-60% faster incident resolution using Lightrun

StartApp saves a lot of time by debugging with Lightrun, relieving their teams of unnecessary repetitive processes and freeing them up to focus on writing new features. Real-time debugging without needing to add new code (and without having to remove that code later on), proactive alerting, and visibility into the code path that led up to the issue at hand all result in a significant increase in productivity.

Lightrun also supports StartApp in reducing much of the friction associated with incident resolution. Because it is completely integrated into the IDE, Lightrun enables developers to keep their fingers on the pulse of production systems without constant context switching. This streamlined approach has StartApp’s developers reporting less stress and a significantly improved developer experience during incident resolution.