• Industry Ad-Tech
  • Use caseImplementing Lightrun to get richer context when debugging live production incidents
  • Results Start.io improved the efficiency of their incident resolution process and significantly decreased MTTR

How Start.io slashed MTTR by 50%-60% with Lightrun on AWS

On the edge of tech and innovation, Start.io has compiled an unmatched volume of intelligent mobile data insights to help content providers understand user behavior and needs. With 160 employees and 6 offices globally, Start.io partners with over 350,000 leading applications and works with a database of over 1 billion mobile users worldwide.

  • Industry Ad-Tech
  • Use caseImplementing Lightrun to get richer context when debugging live production incidents
  • Results Start.io improved the efficiency of their incident resolution process and significantly decreased MTTR

The Challenge

Juggling loads of traffic and its unknowns

Start.io handles more than 30 billion requests every day. Handling this enormous amount of traffic is an uphill battle – one that is fraught with complicated, nuanced production issues. Two key types of production issues that come with this level of scale are concurrency and parallelism problems. These types of issues appear only under a specific set of circumstances and are often very hard to reliably replicate locally. When these issues happen and are left unmitigated, they tend to lead to severe service disruptions – often causing data corruption and non-standard program behavior. Software-defined caches are another pain point for Start.io’s developers. Some types of caches must remain immutable – that is, once a value is inserted into the cache it must not change. The developers discovered, however, that there are specific configurations under which the values in the cache can indeed change. This results in an unfortunate situation – when the cache is being “hit”, the information returned from it is not the same information the developers expect to see. Investigating these types of issues is also very difficult since the cache gets “dirty” non-deterministically. In other words – while it’s relatively easy to identify that the information returned from the cache is incorrect, it’s hard to know which activities make the cache “dirty”.

“Using Lightrun, we were able to dive deep into production issues instantly, with a single action. By reducing the number of steps in the debugging process, Lightrun helped us reduce our MTTR by 50%-60%. Lightrun is definitely an ideal addition to every company’s toolbox, and is especially helpful in investigating hard-to-replicate production issues” Boris Shmerlin, Director of Advertising R&D at Start.io

The Solution

Leveraging Lightrun for real-time debugging, monitoring, and alerting without ever leaving the IDEn

When Start.io learned of Lightrun’s approach to production debugging, they were immediately intrigued. They have, after all, been hard at work trying to break apart the exact issues Lightrun claims to solve. After a short exploration period, Start.io deployed Lightrun to a significant portion of their production services. Previously, when developers wanted to add more visibility when a specific event occurs in a running application, they had to:

  1. Add a new piece of code that exposes some piece of information
  2. Pour the information produced into an external system, for example, Kibana
  3. eview the information in the external system

Lightrun eliminates this entire process, by opting instead to take a more proactive approach. Using Lightrun, Start.io’s developers now define conditions that determine when the event at hand should occur. Then, when it does, the developers get proactive alerts inside their IDE with all the required information. When the specific case is being “caught” (i.e. when a specific condition is being met), Lightrun automatically pipes the information right to them. This new process is especially handy when debugging the previously mentioned cache issues. By placing Lightrun Snapshots on the relevant parts of the cache and inspecting the stack trace, the developers can now identify the exact flow that caused the cache to misbehave. Without Lightrun, capturing the same stack trace would take significantly longer – resulting in slower MTTR and a decreased quality of service for their customers. Getting a better grip on issues that appear only under specific circumstances is also a breeze using Lightrun. By placing a Lightrun agent in each of their data centers, Start.io’s developers were able to identify – in real-time – issues that are isolated to a single data center and resolve them significantly faster.

The results

50%-60% faster incident resolution using Lightrun

Start.io saves a lot of time by debugging with Lightrun, relieving their teams of unnecessary repetitive processes and freeing them up to focus on writing new features. Real-time debugging without needing to add new code (and without having to remove that code later on), proactive alerting, and visibility into the code path that led up to the issue at hand all result in a significant increase in productivity. Lightrun also supports Start.io in reducing much of the friction associated with incident resolution. Because it is completely integrated into the IDE, Lightrun enables developers to keep their fingers on the pulse of production systems without constant context switching. This streamlined approach has Start.io’s developers reporting less stress and a significantly improved developer experience during incident resolution.


It’s Really not that Complicated.

You can actually understand what’s going on inside your live applications. It’s a registration form away.

Get Lightrun

Lets Talk!

Looking for more information about Lightrun and debugging?
We’d love to hear from you!
Drop us a line and we’ll get back to you shortly.

By submitting this form, I agree to Lightrun’s Privacy Policy and Terms of Use.