Understanding the real state of production services
Taboola’s production environment is particularly dynamic. With many different features in development simultaneously to accommodate the needs of the business, their developers push a large number of changes on a daily basis. To keep up with demand, Taboola developers push changes directly into production and use advanced feature-flagging techniques to verify their code works as expected.
Each of Taboola’s servers performs hundreds of thousands of queries per second (QPS) to support the roughly 0.5 million requests that hit Taboola’s servers every second. To support that enormous job, Taboola built a data center that hosts heavyset servers with up to 2 TB of RAM each. Since most developer laptops simply can’t handle the loads that these machines deal with daily, developers often have trouble reproducing issues from a production environment in their local one.
In addition, fast, repetitive code changes imply fast, repetitive deployments. With a deployment cycle that takes anywhere from 30 minutes to an hour, a high pace of changes, and the aforementioned bug reproduction challenges, re-building and re-deploying while tackling complicated issues can sometimes take literally hours. Combined with the fact that – due to their vast user base – each minute of downtime equals thousands of dollars in lost revenue, Taboola needed fine-grained controls to guarantee quality, reliability, and security continuously.
Taboola was looking for a solution that will enable them to make sure each released version works without a hitch. Ultimately, they looked for a tool that would allow them to troubleshoot issues and validate feature behavior in production services in a quick, developer-friendly way. They needed the said tool to allow them to figure out if a new feature holds up in production, and also aid them in performing fast root cause analysis with as little unnecessary context switches as possible.
“Lightrun has been a game-changer for us. With Lightrun we shortened our development process significantly by skipping iterative deployment cycles when adding logs and metrics. A day’s work turned into just one hour. Lightrun provided us with new observability into our production environment that was not accessible to us beforehand. Lightrun is a key component in our developer toolset here at Taboola and one of our development best practices.”
Rami Stern, R&D Infrastructure Team Leader
Real-time production debugging with Lightrun
The difference between lost revenue and happy customers for Taboola is a speedy incident resolution process. However, production issues come in many different shapes and forms, and not all of them are easy to resolve quickly.
For example, when working on logically complicated flows, it’s often difficult to understand the code paths that various requests take at every run. Using Lightrun, Taboola developers can now issue conditional snapshots that allow them to identify the state of a particular request; in order to isolate the specific request, a developer can insert any valid Java expression (as complex as it may be), and a snapshot will only be taken when that expression evaluates to true. This allows developers to get information relevant to the request – and only that information – without sifting through endless screens in their logging systems.
Performance bottlenecks are another popular form of production issues. In order to understand which part of the system is causing the latency, collecting and visualizing metrics is a rather common practice among troubleshooting developers.
Using Lightrun, developers can insert TicTocs – real-time, on-demand metrics – to measure the amount of time a certain piece of code took to execute. Lightrun offers a few of these types of code-level metrics (method durations, counters, etc..) that are extremely valuable in identifying bottlenecks during real-time sessions.
Previously, developers often found themselves adding metrics into the actual code. These metrics, when left unattended, can make the codebase bloated and add additional overhead to each transaction due to the cost of instrumenting them. This means that the developers had to remove them in the next deployed version. Lightrun allows real-time, on-demand addition and removal of metrics, where each metric only has a negligible performance footprint.
In addition and as mentioned above, Taboola uses feature-flagging and progressive delivery to safely test the behavior of new features in production, without affecting the entire customer base. When Taboola pushes a new version to production, they first push it to a smaller customer subset. They then use Lightrun logs and snapshots to verify the behavior of the new feature, the state of the application, and the path the code takes are as expected. In addition, they verify that the performance of the newly introduced code meets the required benchmarks using Lightrun metrics. If all the expectations are met, they then gradually roll it out to the rest of the customer base.
With Lightrun, teams were able to get to the root cause of issues in record time. Hidden, implicit backend issues that took up to two weeks to mitigate in the past were now resolved in under an hour using Lightrun logs, snapshots, and metrics – without ever interrupting a running production service.