Juggling loads of traffic and its unknowns
StartApp handles more than 30 billion requests every day. Handling this enormous amount of traffic is an uphill battle – one that is fraught with complicated, nuanced production issues.
Two key types of production issues that come with this level of scale are concurrency and parallelism problems. These types of issues appear only under a specific set of circumstances and are often very hard to reliably replicate locally. When these issues happen and are left unmitigated, they tend to lead to severe service disruptions – often causing data corruption and non-standard program behavior.
Software-defined caches are another pain point for StartApp’s developers. Some types of caches must remain immutable – that is, once a value is inserted into the cache it must not change. The developers discovered, however, that there are specific configurations under which the values in the cache can indeed change. This results in an unfortunate situation – when the cache is being “hit”, the information returned from it is not the same information the developers expect to see.
Investigating these types of issues is also very difficult since the cache gets “dirty” non-deterministically. In other words – while it’s relatively easy to identify that the information returned from the cache is incorrect, it’s hard to know which activities make the cache “dirty”.
“Using Lightrun, we were able to dive deep into production issues instantly, with a single action. By reducing the number of steps in the debugging process, Lightrun helped us reduce our MTTR by 50%-60%. Lightrun is definitely an ideal addition to every company’s toolbox, and is especially helpful in investigating hard-to-replicate production issues.”
Leveraging Lightrun for real-time debugging, monitoring, and alerting without ever leaving the IDE
When StartApp learned of Lightrun’s approach to production debugging, they were immediately intrigued. They have, after all, been hard at work trying to break apart the exact issues Lightrun claims to solve. After a short exploration period, StartApp deployed Lightrun to a significant portion of their production services.
Previously, when developers wanted to add more visibility when a specific event occurs in a running application, they had to:
- Add a new piece of code that exposes some piece of information
- Pour the information produced into an external system, for example, Kibana
- Review the information in the external system
Lightrun eliminates this entire process, by opting instead to take a more proactive approach. Using Lightrun, StartApp’s developers now define conditions that determine when the event at hand should occur. Then, when it does, the developers get proactive alerts inside their IDE with all the required information. When the specific case is being “caught” (i.e. when a specific condition is being met), Lightrun automatically pipes the information right to them.
This new process is especially handy when debugging the previously mentioned cache issues. By placing Lightrun Snapshots on the relevant parts of the cache and inspecting the stack trace, the developers can now identify the exact flow that caused the cache to misbehave. Without Lightrun, capturing the same stack trace would take significantly longer – resulting in slower MTTR and a decreased quality of service for their customers.
Getting a better grip on issues that appear only under specific circumstances is also a breeze using Lightrun. By placing a Lightrun agent in each of their data centers, StartApp’s developers were able to identify – in real-time – issues that are isolated to a single data center and resolve them significantly faster.