Remote Debugging in a WFH era

Jan 23, 2021

Lightrun Team

6 mins read

While remote work ain’t new for developers, an overwhelming number of tech companies still had onsite developers when the pandemic hit. COVID-19 threw the software development world into a spiral, led by Microsoft and other tech giants. It seems like remote development is the new norm.

Working remotely introduced multiple changes to the development process with a focus on the way developers communicate and collaborate with each other. Existing solutions that were not designed to be used in a remote setup increased organizations’ security concerns and they started to look for ways to mitigate the risk while addressing their developers’ needs.

An existing solution – the remote debugger – turned out to be suitable for WFH and quickly gained popularity.

What’s a remote debugger?

The debugger is a fundamental development tool that can connect to another running application and enable developers to explore its state and behavior, one breakpoint at a time. Remote debuggers are relatively newer and are capable of doing the same for applications that are running on different servers (or in the cloud).

Using an old tool in a new world

With the increasing popularity of remote debuggers new challenges arise in a distributed production setting:

Breakpoints pause active application – while allowing developers to perform step-by-step analysis, and variable inspection is great for local debugging, doing so with live applications disrupts the customer experience. Imagine not being able to perform a Google search since someone placed a breakpoint in production code. In addition, when a breakpoint pauses the application execution it reduces our chances to reproduce a bug and useful data (such as method execution time or the exact exit time) that could be helpful down the road is not being recorded.
Remote debuggers are insecure by design since they use a known port to communicate with client tools (such as IDE). Combined with the fact that remote debuggers have neither an authorization-authentication scheme nor do they have built-in audit trail we end up with a pretty major security backdoor that hackers could take advantage of. They could retrieve code-level information or worse – control the service behavior. This is obviously unacceptable.
Remote debuggers violate the SOC2 compliance standard (and alike) since they rely on direct communication with production systems. Furthermore, remote debuggers disrupt the segregation of duties between production and development which conflicts with common enterprise security policy.
During the development phase, a single application instance and server (usually the developer machine) are required. In a cloud-native era where microservices are the new norm and container orchestrators such as Kubernetes (K8s) are all the rage, more often than not, multiple instances of the application are at play. This surfaces two concerns – identification and concurrency. High-scale, microservices-based production applications utilize numerous threads running inside multiple instances to digest application data concurrently while load balancers keep the entire system in equilibrium. With such a complex setup, it’s often hard to identify (and attach to) the offending instance. In addition, when troubleshooting code in a multi-tenant environment, a specific request cannot affect other users. Adding a breakpoint to a live busy instance is not only unreasonable but most likely forbidden.
Legacy debugging practices are a non-collaborative process that contradicts agile development. Since multiple engineers are often required to resolve an incident this could be a serious barrier and will surely extend the incident response time. This challenge is magnified for remotely distributed teams.
Remote debuggers are relatively slow due to their design and the underlying protocol. Since debugging protocols were designed to be highly interactive they are chatty (i.e. use large numbers of requests/responses). Chatty protocols tend to be super slow over high latency networks.

Introducing a new concept – Production Debugger

Given the rapid changes to the modern development environment, rather than solving old tool issues, it might be worth taking a fresh look and designing a new solution. The ideal solution would take into consideration current production environments, the speed at which developers are expected to resolve issues, enterprise security constraints, and regulations.

For lack of a better term, let’s name the solution: Production Debugger. The main goal is to improve the debugging process and reduce the average time to resolve production issues. Production debuggers should provide contextual, code-level information without pausing the service.

The ideal Production Debugger should:

Not disrupt user experience by pausing the application but rather allow to (non-intrusively) define the information required to be collected from the application during execution.
Provide more granular information regarding all variables and support different observability pillars such as logs, metrics, traces, expression evaluation, and variable exploration. Take for example the case of performance analysis in conjunction with other common production issues like race conditions and other forms of infrastructure-based latency. Inspecting local variables values and current stack trace is useful to understand the application context, but insufficient for proper performance analysis. Knowing the time it takes to execute a function or the in-memory data structure size would help to resolve production performance issues significantly faster.
“Think distributed” – Modern software environments involve constant container restarts and active applications that are decoupled from the server’s bare metal. Detecting an offending application instance inside a Kubernetes cluster and sending a request for information to every container in every pod could get complicated. The ideal debugger should consider all of the underlying application code a single entity and allow debugging all instances simultaneously Instead of independently demanding to select on. In other words, eliminating the prerequisite of selecting application instances to debug would greatly simplify code maintenance.
Support secured, managed, and “read-only” access – production-grade applications are precious, have access to highly sensitive data, and as such should be handled with great care. In order to comply with enterprise-level security standards and controls, developers accessing production systems must be authenticated and restricted to read-only rights only. In addition, in order to ensure application stability, the debugger must have a negligible footprint that should not affect the application performance over time.
Be very fast – the ideal protocol should minimize the number of requests and their size. In other words, fewer requests containing richer information would reduce the number of (client-server) round trips and significantly speed up the process.
Designed for agile development and collaboration and enables multiple developers to debug simultaneously while making all actions immediately visible to the development team involved regardless of their location. A tool that empowers developers to better collaborate on incident response can reduce at least some of the stress inherent to this ongoing transition.

Disclaimer: Lightrun was built from the ground to support the principles above. We offer an innovative approach to remote debugging with better and faster visibility into running applications to enable the quickest incident resolution. Request a demo to learn more.