Debugging Microservices in Production: An Overview
In my previous blog post “Debugging Microservices: The Ultimate Guide” I discussed the challenges of debugging microservices in your local development environment, and presented the tools most developers use to understand production issues. This toolkit is far from complete, and does not resemble the polished and powerful experience that modern IDE debuggers provide. The tools are, however, constantly improving – making developers’ life easier and this world a better place.
Ok, so those tools are what you have in hand while comfortably sitting back in your office chair, sipping slowly cooling coffee and eating cookies full of carbs & saturated fat. But what will you do when that dreaded support engineer call comes, telling you that your lovely microservice started dropping connections in your most valuable customer’s production environment?
Broadly speaking, most of the new debugging challenges you are expected to face with distributed microservices can be categorized either as networking or observability problems. Let’s have a look at each of these categories, starting with networking, and then examine the best ways to debug issues of both types.
Networking Challenges of Distributed Applications
Interservice communication in distributed systems is implemented either as a request/response synchronous communication (REST, gRPC, GraphQL) or asynchronous event-driven messaging (Kafka, AMQP and many others). Synchronous mechanisms are the clear winners – at least as of mid 2020 – because it is much easier to develop, test and maintain synchronous code. But what are the challenges and issues they bring to the table? Let’s take a deeper look:
Inconsistent Network Layers
Your microservices might be deployed in various, different public clouds or on-prem, which means the networking layer a service is based on top of can vary drastically between services. This is often the cause of sudden, non-reproducible timeouts and bursts of increased latency and low throughput. These are often a sad daily routine, most of which is out of your control.
Microservices are dynamic, so the routing should be as well. It’s not clear to a service where exactly in the topology its companion service is located, so a tooling is needed to allow each service to dynamically detect its peers.
Cascading Failures and Propagated Bottlenecks
Any microservice may start responding slower to the network requests from other services because of high CPU, low memory, long-running DB queries and who else knows what. This may end up causing a chain reaction that will slow down other services, thus causing even more bottlenecks, or making them drop connections.
Error Recovery and Fault Tolerance
Microservices – by definition – have much more moving parts that can fail along the way in comparison to monolith applications. This makes the graceful handling of the inevitable communication failures both critical and complicated.
Language specific networking SDKs may handle various edge cases in a different way, which adds instability and chaos to interservice communication.
Load Balancing Complexity
In a world of monolithic applications the traffic is primarily north-south (from the Internet to the applications servers) and there are plenty of well-known solutions like API Gateways and load balancers that take care of the load. Microservice applications communicate with each other constantly, adding far more east-west traffic, which introduces an additional level of complexity.
One of the biggest advantages of the microservice approach, as we already mentioned, is independent scalability – each part of the system can be scaled on its own. Synchronous communication literally kills this advantage: if your API Gateway synchronously communicates with a database or any other downstream service, any peak load in the north-south traffic will overwhelm those downstream services immediately. As a result, all the services down the road will be in need of rapid and immediate scaling. Independently scalable, you say?
Difficult Security Configuration
East-west traffic requires a lot more SSL certificates, firewalls, ACL policy configuration and enforcement which is non-trivial and error prone, when done manually.
Networking – Summary
To sum it all up, one can say that implementing a synchronous style of interservice communication literally contradicts the whole point of breaking your monolith to the microservices. Some even claim that it turns your microservices back into a monolith and I tend to agree. At the very least, synchronous RPC mechanisms introduce tight coupling, cascading failures and bottlenecks, and increase load balancing and service discovery overhead. All of this does not scale well.
Do you still have the courage for this mess? If you do, let’s see what the industry currently has to offer as solutions for this kerfuffle.
For a synchronous request/response style architecture, a service mesh is the current de-facto standard solution. In a nutshell, a service mesh manages all the service-to-service, east-west traffic. It consists of three main parts – a data plane (where the information that needs to be moved between services lives), a sidecar component (that serves as the transport layer), and a control pane (that configures and controls the data plane).
The idea of a service mesh is to offload all the interservice communication tasks and issues to a separate abstraction layer that takes care of all of this transportation hassle – allowing the microservice code to focus on business logic only. Typical service mesh solutions offer at least some of the following features:
- Traffic control features – routing rules, retries, failovers, dynamic request routing for A/B testing, gradual rollouts, canary releases, retries, circuit breakers and so on
- Health monitoring – such as health checks, timeouts/deadlines, circuit breaking
- Policy enforcement – throttling, rate limits and quotas
- Security – TLS, application level segmentation, token management
- Configuration and secret management
- Traffic observability and monitoring – top-line metrics (request volume, success rates, and latencies), distributed tracing and more
Looks like a silver bullet, doesn’t it? Indeed, service mesh solutions address most of the challenges, remove a need for costly API Gateways and load balancers for east-west traffic, standardize handling network issues and configuration across the polyglot services, take care of services discovery, and even make you coffee – so:
Hold on. Let’s talk about pitfalls, because there are some:
- The technology is still at its early adoption phase, and is subject for constant and breaking changes
- Service meshes require an upfront investment in a platform, an expense that can be difficult to justify when applications are still evolving
- Performance penalty (both in network latency and runtime resources consumption) is inevitable, and practically unpredictable
- Increased operation complexity and R&D costs
- Service mesh functionality may duplicate existing application logic which may lead, for example, to redundant retries and duplicated transactions
- Multi-cluster topologies are generally not-well supported
The most important drawback, though, is the lack of support for asynchronous event-driven architecture. While we are not going to discuss synchronous vs asynchronous microservices communication here, the tl;dr is that the asynchronous approach suits the paradigm of microservices much better (if that interests you, you should read these Microsoft & AWS blog posts to see why and how exactly it solves a lot of challenges synchronous communication introduces). For now, let’s see what issues and challenges asynchronous communication brings with it:
- Distributed transactions – Because of the networking layer of interservice communication, the atomicity of DB operations can not be enforced by a DB alone. You might need to implement an additional abstraction layer to enforce it, which is not a trivial task: a two-phase commit protocol can cause performance bottlenecks (or even deadlocks!), and the Saga pattern is pretty complicated – so data consistency issues are pretty common. Note that this not a networking issue per se, and, strictly speaking, is also relevant for synchronous communication.
- Message queues TOC – Message queues are not easy to integrate with, test, configure and maintain. The maintenance and configurations get much easier by using managed solutions (i.e. AWS SQS and SNS), but then you may face budget and vendor-lock issues.
- Observability – Understanding what’s going on in distributed, asynchronously communicating applications is hard. The three pillars of observability – logs, metrics, and traces – are extremely difficult to implement and manage in a way that will make sense for further debugging and monitoring (which is also true for synchronous communication, by the way).
To sum it up, asynchronous message queues are more of a solution than a problem, at least in comparison to the difficulties that come with synchronous communication. As for the issues it brings to the table – because of the inherent complexities, there are no drop-in silver bullet solutions which will solve it for you, so the only thing that you can actually do is to make sure only highly skilled developers are involved, and they use at least some of the available microservices observability solutions. Let’s name just a few:
- If you are into an in-house solution, take a look on Open Tracing, a Cloud Native Computing Foundation incubating project, which is rapidly becoming the de-facto standard, vendor-neutral API for distributed tracing. It has libraries available in 9 languages, including Go, JS and Java.
- Some of the services mentioned in the previous post, like Logz.io and Datadog, are kind of must-haves, cost-of-playing solutions for centralized, correlated distributed tracing.
- Request tracing is a private case of a distributed tracing, which tracks a request flow across different systems and lets you know how long a request took in a web server, database, application code and so on – all presented along a single timeline. Jaeger and Zipkin are worth a look.
- Epsagon – a company that develops a lightweight agent SDK, which currently supports Node.js, Python, Go, Java and .Net, and provides automated instrumentation and tracing for containers, VMs, serverless and more with no training, manual coding, tagging or maintenance required. Among other cool features there is a so-called Service Map which creates your application’s visual architecture, where you can drill into metrics, logs and so on, as well as AI-based prediction and alerting of issues before they happen.
- Lumigo – a solution that implements similar features, but with a focus on serverless.
Lightrun also enables debugging microservices. You can add logs, performance metrics and breakpoints (snapshots) in real-time without having to issue hotfixes or reproduce the bug locally – which makes life much easier when debugging microservices. Request a demo to learn more.
I hope you enjoyed reading this long read and learned something useful, which will help you in this hard, but exciting journey of developing microservices. Stay tuned for more and let us know if we missed something!
It’s Really not that Complicated.
You can actually understand what’s going on inside your live applications. It’s a registration form away.