Debugging microservices in production
Author Lightrun Marketing
Debugging microservices in production

Debugging Microservices in Production: An Overview

Lightrun Marketing
9 min read

In my previous blog post, “Debugging Microservices: The Ultimate Guide”, I discussed the challenges of debugging microservices in your local development environment, and presented the tools most developers use to understand issues. This toolkit is far from complete, and does not resemble the polished and powerful experience that modern IDE debuggers provide. The tools are, however, constantly improving – making developers’ life easier and this world a better place.

Ok, so these tools are what you have at hand while comfortably sitting back in your office chair, sipping slowly cooling coffee and eating cookies full of carbs & saturated fat. But what will you do when that dreaded support engineer call comes, telling you that your lovely microservice started dropping connections in your most valuable customer’s production environment? What’s your strategy for production debugging?

Broadly speaking, most of the new debugging challenges you are expected to face with distributed microservices can be categorized either as networking or observability problems. Let’s have a look at each of these categories, starting with networking, and then examine the best ways to debug issues of both types.

Networking Challenges of Distributed Applications

Interservice communication in distributed systems is implemented either as a request/response synchronous communication (REST, gRPC, GraphQL) or asynchronous event-driven messaging (Kafka, AMQP and many others). Synchronous mechanisms are clear winners because it is much easier to develop, test and maintain synchronous code. But what are the challenges and issues they bring to the table? Let’s take a deeper look:

Inconsistent Network Layers

Your microservices might be deployed in various public clouds or on-prem, which means the networking layer a service is based on top of can vary drastically between services. This is often the cause of sudden, non-reproducible timeouts and bursts of increased latency and low throughput. These are often a sad daily routine, most of which is out of your control.

Service Discovery

Microservices are dynamic, so the routing should be as well. It’s not clear to a service where exactly in the topology its companion service is located, so a tooling is needed to allow each service to dynamically detect its peers.

Cascading Failures and Propagated Bottlenecks

Any microservice may start responding slower to the network requests from other services because of high CPU, low memory, long-running DB queries and who knows what. This may end up causing a chain reaction that will slow down other services, thus causing even more bottlenecks, or making them drop connections.

Error Recovery and Fault Tolerance

Microservices by definition have a lot more moving parts that can fail along the way in comparison to monolith applications. This makes graceful handling of inevitable communication failures both critical and complicated.

Different Languages

Language specific networking SDKs may handle various edge cases in a different way, which adds instability and chaos to interservice communication.

Load Balancing Complexity

In the world of monolithic applications traffic is primarily north-south (from the Internet to application servers) and there are plenty of well-known solutions like API gateways and load balancers that take care of the load. Microservice applications communicate with each other constantly, adding far more east-west traffic, which introduces an additional level of complexity.

Scale Limitations

One of the biggest advantages of the microservice approach, as we already mentioned, is independent scalability – each part of the system can be scaled on its own. Synchronous communication literally kills this advantage: if your API gateway synchronously communicates with a database or any other downstream service, any peak load in the north-south traffic will overwhelm those downstream services immediately. As a result, all the services down the road will be in need of rapid and immediate scaling. Independently scalable, you say?

Difficult Security Configuration

East-west traffic requires a lot more SSL certificates, firewalls, ACL policy configuration and enforcement, which is non-trivial and error prone when done manually.

Networking – Summary

To sum it all up, one can say that implementing a synchronous style of interservice communication literally contradicts the whole point of breaking your monolith into microservices. Some even claim that it turns your microservices back into a monolith and I tend to agree. At the very least, synchronous RPC mechanisms introduce tight coupling, cascading failures and bottlenecks, and increase load balancing and service discovery overhead. All of this does not scale well.

Current Tools

Do you still have the courage for this mess? If you do, let’s see what the industry currently has to offer as solutions for this kerfuffle.

For a synchronous request/response style architecture, a service mesh is the current de-facto standard solution. In a nutshell, a service mesh manages all the service-to-service, east-west traffic. It consists of three main parts: a data plane (where the information that needs to be moved between services lives), a sidecar component (that serves as the transport layer), and a control pane (that configures and controls the data plane).

The idea of a service mesh is to offload all the interservice communication tasks and issues to a separate abstraction layer that takes care of all of this transportation hassle – allowing the microservice code to focus on business logic only. Typical service mesh solutions offer at least some of the following features:

  • Traffic control: routing rules, retries, failovers, dynamic request routing for A/B testing, gradual rollouts, canary releases, retries, circuit breakers, and so on
  • Health monitoring, such as health checks, timeouts/deadlines, circuit breaking
  • Policy enforcement: throttling, rate limits and quotas
  • Security: TLS, application level segmentation, token management
  • Configuration and secret management
  • Traffic observability and monitoring: top-line metrics (request volume, success rates, and latencies), distributed tracing and more

Looks like a silver bullet, doesn’t it? Indeed, service mesh solutions address most of the challenges, remove a need for costly API gateways and load balancers for east-west traffic, standardize handling network issues and configuration across polyglot services, take care of service discovery, and even make you coffee – so:

Shut Up And Take My Money!

Hold on. Let’s talk about pitfalls, because there are some:

  • The technology is still at its early adoption phase, and is subject to constant and breaking changes
  • Service meshes require an upfront investment in a platform, an expense that can be difficult to justify when applications are still evolving
  • Performance penalty (both in network latency and runtime resource consumption) is inevitable, and practically unpredictable
  • Increased operation complexity and R&D costs
  • Service mesh functionality may duplicate existing application logic which may lead, for example, to redundant retries and duplicated transactions
  • Multi-cluster topologies are generally not well supported

The most important drawback, though, is the lack of support for asynchronous event-driven architecture. While we are not going to discuss synchronous vs asynchronous microservices communication here, the tl;dr is that the asynchronous approach suits the paradigm of microservices much better. (If you’re interested, read these Microsoft & AWS blog posts to see why and how exactly it solves a lot of challenges that synchronous communication introduces.) For now, let’s see what issues and challenges asynchronous communication brings with it:

  • Distributed transactions. Because of the networking layer of interservice communication, the atomicity of DB operations can not be enforced by a DB alone. You might need to implement an additional abstraction layer to enforce it, which is not a trivial task: a two-phase commit protocol can cause performance bottlenecks (or even deadlocks!), and the Saga pattern is pretty complicated – so data consistency issues are pretty common. Note that this not a networking issue per se, and, strictly speaking, is also relevant for synchronous communication.
  • Message queues TOC. Message queues are not easy to integrate with, test, configure and maintain. The maintenance and configurations get much easier by using managed solutions (i.e. AWS SQS and SNS), but then you may face budget and vendor-lock issues.
  • Observability. Understanding what’s going on in distributed, asynchronously communicating applications is hard. The three pillars of observability – logs, metrics, and traces – are extremely difficult to implement and manage in a way that will make sense for further production debugging and monitoring (which is also true for synchronous communication, by the way).

To sum it up, asynchronous message queues are more of a solution than a problem, at least in comparison to the difficulties that come with synchronous communication. As for the issues it brings to the table – because of the inherent complexities, there are no drop-in silver bullet solutions which will solve it for you, so the only thing that you can actually do is to make sure only highly skilled developers are involved, and they use at least some of the available observability solutions for microservices. Let’s name just a few:

  • If you prefer in-house solutions, take a look at Open Tracing, a Cloud Native Computing Foundation incubating project, which is rapidly becoming the de-facto standard, vendor-neutral API for distributed tracing. It has libraries available in 9 languages, including Go, JavaScript and Java.
  • Some of the services mentioned in the previous post, like and Datadog, are kind of must-haves, cost-of-playing solutions for centralized, correlated distributed tracing.
  • Request tracing is a case of distributed tracing, which tracks request flow across different systems and lets you know how long a request took in a web server, database, application code and so on – all presented along a single timeline. Jaeger and Zipkin are worth a look.
  • Epsagon is a company that develops a lightweight agent SDK, which currently supports Node.js, Python, Go, Java and .NET. It provides automated instrumentation and tracing for containers, VMs, serverless and more with no training, manual coding, tagging or maintenance required. One of its greatest features is the Service Map: a way to create your application’s visual architecture where you can drill into metrics, logs, and so on. There’s also AI-based prediction and alerting of issues before they occur.
  • Lumigo is a solution that implements similar features, but with a focus on serverless.

Lightrun also enables debugging microservices in production. You can add on-demand logs, performance metrics and snapshots (breakpoints that don’t stop your application) in real time without having to issue hotfixes or reproduce the bug locally – all of which makes life much easier when debugging microservices. You can start using Lightrun today, or request a demo to learn more.

In Summary

I hope you enjoyed reading this long article and learned something useful, which will help you in this hard, but exciting journey of developing microservices. Stay tuned for more and let us know if we missed something!


It’s Really not that Complicated.

You can actually understand what’s going on inside your live applications.

Try Lightrun’s Playground

Lets Talk!

Looking for more information about Lightrun and debugging?
We’d love to hear from you!
Drop us a line and we’ll get back to you shortly.

By submitting this form, I agree to Lightrun’s Privacy Policy and Terms of Use.