13-Jan-2021| Best Practices, Debugging |11 min read

 

While remote work has been a constant theme in developer communities for quite a while, an overwhelming majority of software companies (and with them, an overwhelming majority of developers) were still working from central offices until the pandemic hit. COVID-19 threw the world of software development into a spiral.

And now, it looks like remote is here to stay. As reported not long ago by The Verge, Microsoft is letting more employees work from home permanently. And they’re not the only ones.

This transition to a more permanent work from home (WFH) setup carries a sizable set of changes with it for software development companies and the developers working for them. These changes include the primary communication channels between team members, the ways in which developers collaborate to solve issues together, and – from the organization’s perspective – a change in the security landscape. With developers spread around the world instead of working from the same office, security concerns have increased exponentially overnight.

With such extreme changes in circumstances expected to become permanent for the long run, organizations must face the challenge head-on, admit when parts of the current stack are simply not working anymore, and find alternative solutions. One of the tools that already had started falling behind even before this big shift is the remote debugger; naturally, its faults have become even more obvious since the start of this new age.

What’s a remote debugger anyway?

A debugger is a piece of software that is attached to a running application ad-hoc, enabling the developer to explore the current state of the application, one breakpoint at a time.

Remote debugging is very similar to regular, local debugging, except for the fact that local debuggers work with local applications, and remote debuggers work with remote applications. Remote applications, in this context, are those hosted on customer-facing production servers.

An old tool in a new world

There are some underlying issues with using a remote debugger in a production setting:

  1. A breakpoint, as the name implies, stops the running service – allowing developers to perform step-by-step analysis, including inspection of variable values and other key pieces of information relevant to the current application state. This is great for local debugging but is very disruptive when working on live applications in a production setting.The key issue is, of course, customer experience. Imagine not being able to search for something on Google because someone placed a breakpoint somewhere in the code. But the problems don’t stop there – by placing a breakpoint, you are essentially stopping the application from running in production, in order to understand a specific issue that is happening only when running in production.Not only is there no guarantee you’ll even be able to witness the problem under such circumstances, but you’ve also stopped yourself from gathering temporal data. For example, because you’ve stopped the process, there is no way to know the execution time of a certain method or the exact time it exited (both of which can be highly valuable when trying to resolve performance problems).
  2. Remote debuggers are insecure by design. In most cases, a client tool (such as an IDE) interacts with the remote debugger’s agent over a TCP port. This port is intentionally left open in order to accept incoming connections from the client.Simultaneously, most remote debuggers have no authorization or authentication scheme. With an open TCP port and no identity and access management, anyone can access the remote debugger’s port and retrieve code-level information about the running service and – more importantly – stop and start it at will. By doing so, you’re inherently exposing your organization to countless vulnerabilities and introducing new security and performance risks. Furthermore, since most remote debuggers don’t contain a built-in audit trail, it’s practically impossible to follow the path of a possible attacker. As the system’s owner, you have no visibility into who accessed the production service, when they accessed it, or what they did while connected to it.In addition, not every developer on the team should have the ability to debug a production service directly. The ones who do, however, should only be assigned privileges that align well with their role in the organization. Product owners – for example – should have full access to the service; other stakeholders might need to be somehow limited, perhaps by a mandatory timeout for each debugging session or the ability to only perform read-only actions.When working with sensitive resources like production services, it’s crucial to have this type of fine-grained controls in order to maintain the security of the resource.
  3. The direct communication with a production system using a remote debugger breaks compliance with standards such as SOC 2, which enforce important access (and other security) regulations.Furthermore, the methodology of working with a remote debugger disrupts the segregation of duties between production and development. In most enterprise contexts this is a no-go, meaning the tool can’t be used at all due to policy constraints.
  4. When developing software, one normally works against a single instance of the developed application (usually a process or a set of processes running on the developer’s own computer).In practice, in a world where cloud-native microservices are quickly becoming the new norm and container orchestrators like Kubernetes (K8s) are all the rage, you’re usually not debugging a single process running on a single, readily accessible machine. More often than not, you’re trying to solve a distributed systems problem. In order to better understand this, we can break it down into two different (but related) concerns – identification and concurrency.First of all, when working with any high-scale, microservices-based production application, you’re working with multiple threads digesting data concurrently, running inside multiple instances of the same app, with load balancers that direct traffic in order to keep the entire system in a state of equilibrium. With this kind of setup, it’s often hard to identify the exact instance in which the offending code is running in order to attach the remote debugger.Moreover, in multi-tenant environments, if you’re troubleshooting a specific transaction or request, you want to make sure you’re not affecting all the other users of the system. Adding a breakpoint to a live instance that is potentially serving thousands or tens of thousands of requests is not only unreasonable but sometimes explicitly forbidden – by either the company security policy or due to various SLAs.
  5. Using a remote debugger is an anti-collaborative, single person process. Not very agile, to say the least. Since the remote debugger stops the running process, only one such debugger can do so at a time, meaning only one developer can actually work on the problematic process at any point in time.Since multiple engineers are often involved in the resolution of an incident (not all engineers have a deep knowledge of every part of every system in a given company), this can cause a lot of back-and-forth communication that wastes precious incident response minutes. This is especially true when employees are working in a distributed, remote setting.
  6. Remote debuggers are relatively slow by the nature of their protocol, which is even more noticeable in a world where most developers work from home.Most remote debugging protocols are based around an interactive request-response manager and are designed to be very interactive – producing a large number of requests for even a single breakpoint in the IDE.Moreover, every time you expand a complex object in order to explore it, more and more requests are made – one for each additional piece of information. The overall result is an increase in the latency between the time the information was requested and the time the information becomes available.

Defining a new solution – a Production Debugger

Rather than solving the underlying problems of remote debuggers, then, let’s define what a good solution looks like from the ground up. Such a solution takes into consideration (in addition to the need for better collaboration tools in today’s remote-first world) the new environments in which we run our services, the speed at which we need to fix them when they break and the security and regulations we must comply with in the modern software business landscape.

One might refer to such a solution as a production debugger. Using a production debugger is, in many ways, a re-imagination of the process of debugging – one that enables significantly quicker iterations. Rather than debugging breakpoint-by-breakpoint when the process is stopped, a production debugger allows you to get contextual, code-level information while the process is running.

Let’s take a look at a few things that characterize this new solution. This tool must:

  1. Be non-blocking – being active should never mean being disruptive, which is why it makes no sense to stop services just to get more information.As we mentioned before, a production debugger does not stop the running process but instead allows you to (non-intrusively) define the information you want to collect from the application at runtime.
  2. Emit more granular information – a production debugger allows for all three pillars of observability – logs, metrics, and traces – and not just expression evaluation and variable exploration.Consider the issue of performance bottlenecks mentioned above, for example. Now consider them in conjunction with other common production issues like race conditions and other forms of infrastructure-based latency. Inspecting the values of local variables and the contents of the current stack trace is great for understanding the context the application is running in. In order to debug performance or latency issues, however, that information is simply not enough.To better understand these issues, we clearly need better tooling, tooling that is dedicated to the collection of temporal data. For example, we’d like to be able to measure the time it takes a method to execute or the size of a specific data structure in memory. These can be acquired by the instrumentation of code-level metrics – such as method duration measurements, counters, and other custom metrics.If developers could implement these types of code-level metrics into the various paths the code might take, performance issues could be hunted down and mitigated significantly faster.
  3. “Think distributed” – instead of debugging each replication of your service independently, debug all of them at the same time by issuing a request to all of them simultaneously. For example, instead of finding the exact instance that your service is running on inside a Kubernetes cluster, ship a request for more information to every container in every pod that is running the application.In the environments our software runs in today, containers are spinning up and down all around you. On top of that, the running application is decoupled from the bare metal it’s running on. With that in mind, you shouldn’t communicate with your application by referring to its distinct pieces. Instead, consider all of its underlying components as a single entity. This removes a lot of the complexity of maintaining a distributed system.
  4. Have secured, managed, and read-only access – our production applications are precious resources, carefully built and maintained to produce as much value to our customers as possible.It makes little sense to introduce additional risk in the process of fixing them by allowing developers to change the application state without supervision. In order to adhere to enterprise-level compliance and security controls, all access must be read-only and only allowed after proper authentication.This will allow the production debugger to maintain compliance with the leading standards, and ensure the integrity of the service is not compromised.In addition, in order to ensure the stability of the running application, the tool must have a negligible footprint that does not affect the running application’s performance over time.
  5. Be very fast – the protocol of the production debugger should reduce the number of requests and the size of each request. In other words, the tool’s protocol should consist of a smaller number of requests, each containing richer information, thereby reducing the amount of interactivity in the process and speeding it up significantly.If every request-response cycle comes with a certain amount of latency attached to it, we should minimize said latency by minimizing the number of such request-response cycles. Developing a new protocol that acquires significantly more verbose data from the application can help us in this effort.
  6. Be agile and collaborative – the production debugger should enable multiple developers to query the production service simultaneously. As such, production debugger actions should be immediately visible to all engineers currently working on the application.This is especially crucial In these problematic times when many developers and organizations are adapting to working from home. A tool that empowers developers to better collaborate on incident response can reduce at least some of the stress inherent to this ongoing transition.

Lightrun was built from the ground up by relying on these principles, offering an innovative approach to debugging by allowing for a quicker understanding of the context of your running applications and for faster incident resolution. Request a demo to learn more.

Tom is a Solution Engineer at Lightrun, where he works on re-shaping what production observability looks like. Tom was previously a site reliability engineer for a distributed systems startup, teaches technological prototyping for creatives at a local college's media lab, and is an avid explainer of all things tech.

Subscribe for updates

Leave a Reply

Your email address will not be published.