- Industry: Tech
- Use case: Adding logs in real-time, on-demand
- Results: Identifying the root cause of an exception in minutes instead of cycles of redeployments that would have delayed a customer solution, adding a new development best practice
WhiteSource is a market leader in Software Composition Analysis (SCA), providing a security, compliance, and reporting solution for managing open source components. With more than 200 employees and serving 1.3M developers and more than 20 Fortune 100 companies, WhiteSource has to effectively manage its R&D resources and ensure site reliability to its customers. WhiteSource is an agile company, deploying new software versions every two weeks.
The Challenge – Quickly Identifying an Error in Production
The error occurred in a method that collected results from several threads, but WhiteSource could not immediately identify which of the tasks had the problem. The stack trace showed that the last line of code running was for collecting the thread results.
The last stack frame:
The 3rd party stack frame:
WhiteSource was able to identify that the overall exception was an SQL syntax error. However, the log made it very difficult to identify which query was throwing the exception and making it fail.
The logs were not informative and did not enable WhiteSource to identify the solution in the current version running in production. They had to find a way to identify the root cause.
In the past, WhiteSource had resorted to adding new logs to suspected lines of codes in the next deployment. Occasionally, they had had to go through several iterations of adding logs to new versions, until the issue was detected. They would remove the logs in the following deployment, to decrease overhead and logging costs.
This process would sometimes take weeks of iterations, and many hours of developer time – for waiting for the changes to be deployed to production, for inspecting code behavior and re-exploring the issue. The developer would also have to deal with a lot of context switches – every iteration would require the developer to re-read the relevant code, recall the assumptions and continue from there. In the meantime, the version would be running with an error.
Tom Shapira, Director of Software Engineering at WhiteSource: “Using Lightrun to debug an actual issue in production enabled us to react instantly. We were able to add the right logs and identify the root-cause in a real-time session, instead of waiting for redeployments.”
The Solution – Adding Logs with Lightrun On-demand
WhiteSource used Lightrun to dynamically add logs to each thread in production. They needed to identify the problematic query, among all the MySQL queries in their system.
With Lightrun, integrated into their IDE, WhiteSource was able to add these logs in real-time. The process was simple and only took them a few moments. They were then able to quickly identify where the problematic flow occurred and which lines were executed.
The Results – Exception Identification in Minutes
In just a few minutes, WhiteSource was able to identify the line of code they weren’t able to reach with the logs they had originally added, and that was generating the exception. As a result, WhiteSource was able to quickly catch and handle the bug. They discovered they had sent an empty collection of IDs to the query. In the future, they will add a check to see if the collection is empty before sending it to the query.
By using Lightrun, WhiteSource was able to quickly identify an error that would have taken them cycles of redeployments to resolve. This process would have included waiting for future deployments, which occur every two weeks, for adding logs and recreating the issue. They might have had to go through multiple iterations, before identifying the exception root cause. Then, one last iteration for removing the logs. This is a very time consuming and resource heavy process. In some cases, it prevents full deployment of code versions.
In addition, the quick identification enables WhiteSource to quickly identify a new best practice to add to their tool set. Thus, they will ensure this error will not occur again.