21-Nov-2024

Author Eran Kinsbruner

Best Practices

Early Observability in Platform Engineering: Challenges and Solutions

Eran Kinsbruner

21-Nov-2024

Since the emergence of the cloud, the DevOps movement, and the rise of microservices, developers have been increasingly responsible for the operation of their software. “You build it, you run it” (YBYR) and “You build it, you operate it” (YBYO) have become common mantras in the software engineering industry. However, there’s a misunderstanding in this statement. Developers should remain focused on building software. Even if they can be responsible for operating their applications atop a platform, they are not responsible for creating and maintaining the platform. Asking them to do so is like asking a chef to build their own kitchen or grow their own vegetables—an unrealistic expectation that detracts from their primary role.

The discipline of Platform Engineering has emerged to address this gap. As Gartner predicts, by 2026, 80% of large software companies will have a platform engineering team. This team is responsible, among other things, for providing a platform that enables developers to build, deploy, and operate their software—a self-service platform that abstracts away the underlying infrastructure and provides a set of tools and services that developers can use. This platform is a product in itself and needs to be built, run, and maintained by the same principles that are applied to any other software product.

The role of the Platform Engineering team has evolved to address the scalability challenges of DevOps, DevSecOps, and SRE practices. Using Infrastructure as Code (IaC), automation, and Internal Developer Platforms (IDPs), this team provides the self-service capabilities required to streamline the Software Development Life Cycle (SDLC). However, like any other solution to an engineering problem, Internal Developer Platforms come with their own set of challenges. For example, if a critical part of the platform isn’t working as expected, the platform will not be able to provide the services that developers need to make progress. When this happens, the very tools meant to empower developers can instead become bottlenecks, delaying delivery and increasing frustration.

Consider a scenario where a developer attempts to deploy and run a new version of their application but encounters failure. This could stem from the platform’s inability to provision the correct configurations, resources, or dependencies, or from unexpected application behavior. In either case, the developer is left in a quandary, uncertain of the cause or solution. While some fortunate developers might have access to logs or metrics to aid in diagnosis, more often than not, they must resort to contacting the platform engineering team. This escalation inevitably leads to a time-consuming and frustrating investigation process for all parties involved. The irony is palpable: Internal Developer Platforms are designed to accelerate software development and foster developer autonomy. Yet, if developers find themselves blocked by the very platform intended to assist them, it fundamentally undermines its raison d’être.

This situation can be addressed by implementing observability not just in the platform runtime, but early in the development process. In our example, if the developer had access to appropriate observability tools during the early stages of development, they could have identified and resolved the issue before it became problematic at deployment and runtime.

Shifting Left Observability in Internal Developer Platforms

Shifting observability left in IDPs involves embedding observability practices early in the software development and testing phases. By integrating these foundational components directly into the platform, developers gain immediate access to real-time insights. This allows them to diagnose issues, optimize performance, and validate configurations without waiting for deployment or production stages. The approach offers several advantages: reduced feedback loops, accelerated debugging, and enhanced iteration speed.

Early observability elevates IDPs beyond mere self-service infrastructure and platform tools; they become catalysts for developer autonomy and productivity. By empowering developers to monitor and troubleshoot their applications within their existing workflows, IDPs enable them to take ownership of their code’s functionality and reliability while maintaining focus on software development. This proactive approach identifies and resolves issues before they can disrupt the deployment process, streamlining the entire development lifecycle.

Prioritizing observability touchpoints in IDPs offers several additional benefits. It frees up platform engineering teams to focus on strategic initiatives, minimizes the risk of production incidents, and—when issues do arise—reduces the Mean Time to Resolution (MTTR). From a governance standpoint, early observability ensures compliance and security standards are met throughout the development lifecycle. Moreover, it optimizes resource allocation and usage across complex environments such as Kubernetes and hybrid cloud setups.

A key principle in building an effective IDP is productizing the platform—treating it with the same rigor and focus as any other application or service. Golden Path Templates help achieve this goal by standardizing workflows and governance practices. These templates are crucial elements of Internal Developer Platforms, offering pre-architected approaches for development and deployment. They guarantee adherence to best practices and provide consistency across the application portfolio. While Golden Paths should be optional, transparent, and extensible, it’s very important that they include observability practices as a core component. By embedding observability directly into Golden Paths, teams ensure an early approach to troubleshooting and monitoring application behavior from the outset.

Another important aspect of early observability in IDPs is its integration into the testing phase. By embedding observability practices within continuous testing environments and CI/CD pipelines, developers can enhance the testing process, gain deeper insights, and identify issues earlier in the SDLC. For example, long-running tests, quality KPIs, and performance metrics can be centrally monitored and analyzed, providing developers with real-time feedback as soon as they commit their code.

Development and testing phases are the main areas where early observability can have a significant impact on the overall process. To effectively implement this, it is essential to focus on core components such as logs, metrics, and traces, which serve as the foundation—the “three pillars”—of observability. These components should be integrated into the platform’s DNA. Failing to make observability a first-class feature of IDPs can lead to fragmented visibility, siloed data, and manual intervention, undermining the core principles of DevOps and SRE. In other words, if making observability accessible and actionable as soon as possible is considered an afterthought, the platform will fail to deliver on its promise. Instead, the platform should be designed from the ground up with observability in mind. This is the real challenge of providing early observability in IDPs—it’s not just about tooling or technology; it’s about culture, mindset, and design.

What Makes Early Observability in IDPs Challenging?

In today’s complex and dynamic software landscape, companies adopting IDPs encounter numerous hurdles when implementing early observability practices. These challenges span from technical constraints to cultural barriers, potentially becoming significant roadblocks if not addressed in a comprehensive way. Let’s explore 5 key challenges:

First and foremost, technical complexity poses a significant challenge: Multi-cloud and hybrid environments, diverse runtime configurations, and distributed systems increase the complexity of observability data collection and normalization. Configuration complexity, lack of automation, and non-integrated legacy tools also introduce overhead and more complexity in the implementation of observability practices.

Tooling fragmentation presents another significant challenge: Disconnected systems, lack of unified dashboards, and reliance on separate tools for logging, metrics, and tracing lead to fragmented data and insights.

A significant challenge is organizational resistance. Changing mindsets proves to be the most formidable obstacle. Developers have traditionally viewed observability as an operational responsibility, leading to reluctance in adopting it during the early stages of development and platform construction.

Data overload is another common issue. Large-scale systems generate massive volumes of logs, metrics, and traces, making it challenging to filter noise and focus on actionable insights.

Last but not least, there are some security and compliance concerns. Data sensitivity, secure instrumentation, and compliance mechanisms are critical considerations when implementing observability practices in IDPs. Observability at the early stages of development can expose sensitive data and introduce vulnerabilities if not managed properly.

Surviving these challenges requires a combination of technical expertise, cultural transformation, and strategic planning. But it’s not impossible. The road to early observability in IDPs can be long and winding, but the benefits are worth the effort. Spotify’s Backstage, for example, demonstrates how a well-implemented service catalog can streamline the journey to early observability in IDPs. By centralizing services and their metadata, Backstage provides developers with real-time insights and a unified view of their applications across all stages—but most importantly, the early stages.

While solutions exist, no one-size-fits-all approach can address the unique challenges of every organization. The key is to start small, iterate, and continuously improve. However, one thing is clear in the journey of implementing early observability: there are always unforeseen challenges, unpredictable debugging scenarios, intermittent or hard-to-reproduce issues, and unexpected situations that escape the radar of even the most experienced teams.

Traditional observability tools, even when implemented in IDPs in the early stages, remain “traditional” in the sense that they are static and often fall short during these early stages, as they primarily focus on operational environments. This creates observability gaps, leaving developers without actionable, real-time insights when they are most needed. Furthermore, traditional workflows introduce iterative debugging overhead, requiring repeated deployment cycles to modify logs or gather metrics, which slows down development and troubleshooting. The nature of software engineering amplifies these challenges, with unpredictable debugging scenarios like performance bottlenecks or rare edge cases arising unexpectedly. This is the

inner nature of software engineering—there’s no silver bullet. It’s all about using complementary tools and practices to build a resilient and efficient platform that can adapt to the ever-changing landscape of technology. This is, essentially, where the real value of dynamic developer observability lies.

Addressing IDP Challenges with Dynamic Developer Observability

Imagine a world where developers could instrument their applications on the fly, without redeploying or modifying code. Picture them tracing requests, monitoring performance, and analyzing logs in real-time, all from their development environments. Envision seamless collaboration—sharing insights and troubleshooting issues together—without switching between tools, environments, or contexts. What if they could access dynamic logs and traces from production or any other environment without redeployment? And imagine achieving all this while maintaining security, compliance, and performance standards.

This is the promise of dynamic developer observability. Rare edge cases, intermittent issues, and unpredictable scenarios would no longer be a nightmare; they would be opportunities for making the whole platform more reliable.

If we take a look at the bigger picture, we will see that one of the goals of operation and platform engineers is building reproducible platforms through standardization and automation. Everything should be predictable, repeatable, and deterministic. On the other hand, the goal of developers is creating unique and innovative applications that solve complex real-world problems and adapt to changing requirements based on user feedback, business needs, and market trends. Asking developers to build and run their applications on a platform that is not designed for their very specific dynamic needs adds a reason for more friction and frustration. In this context, traditional observability tools even implemented in IDPs in the early stages will not be enough to bridge the gap between the two worlds. This is why dynamic developer observability is not just a nice-to-have feature; it’s a must-have capability that can compensate for the inherent differences between the operational and development worlds.

Lightrun bridges the gap by bringing dynamic developer observability directly into the heart of the early development process. By embedding its capabilities within Internal Developer Platforms and integrating seamlessly with developer workflows and Golden Paths, Lightrun empowers developers to monitor, debug, and optimize their applications without leaving their preferred tools (e.g.,VSCode,IntelliJ). With dynamic logging and conditional instrumentation, developers can add logs or capture traces in real-time, targeted to specific scenarios, such as high traffic or critical user actions, without requiring code changes or redeployments. Using Lightrun snapshots, developers have access to safe virtual breakpoints that use the same familiar interface but without stopping execution. This feature can be used across various platforms including AWS, Azure, GCP, Kubernetes, Serverless, on-premises, and more.

This integration helps developers tackle intermittent issues and edge cases that traditional observability tools often fail to address. Lightrun’s instant feedback loops allow developers to gain actionable insights within seconds. As a result, Mean Time to Resolution (MTTR) is significantly reduced, thanks to the ability to diagnose and fix issues without waiting for a new deployment cycle.

By facilitating collaborative debugging, Lightrun empowers developers, platform engineers, and operations teams to work together on live issues using shared, real-time observability tools. When combined with VSCode Live Share, this collaboration is elevated, allowing peers to troubleshoot in real time without syncing code or replicating environments. Developers can use Lightrun to add dynamic logs, metrics, and snapshots to live applications while sharing the insights via Live Share or through integrations like Jira, Slack, or read-only links.

Using Lightrun, companies can implement early observability in IDPs, resulting in reduced MTTR. For example, the OurCrowd development team faced challenges with long resolution times for production incidents before implementing Lightrun. With Lightrun, developers were able to add dynamic logs into their production environments without redeploying apps, reducing time to resolution from days to hours. This resulted in a 70% improvement in MTTR and saved developers hours each month.

Lightrun’s industry-first Runtime Autonomous Debugger, powered by GenAI, is the icing on the cake. Despite significant investments in GenAI for coding and testing in IDEs, runtime debugging has lagged behind. Lightrun addressed this gap by integrating autonomous debugging into the developer workflow, from coding to production. This integration provides a comprehensive GenAI experience within the IDE, to improve code quality and reduce MTTR through safe AI-driven transformations. Indeed, Lightrun’s AI assistant IDE plugin analyzes the relevant ticket containing the runtime issue or bug, suggests a potential root cause, and recommends appropriate dynamic logs or snapshots.

What’s Next?

Early observability in Internal Developer Platforms presents both challenges and opportunities for modern software development teams. While implementing observability practices early in the development lifecycle can significantly improve application quality and reduce time to resolution, it also requires overcoming technical complexities, organizational resistance, and data management hurdles. Dynamic developer observability tools like Lightrun offer innovative solutions to these challenges, enabling dynamic, real-time and platform-agnostic instrumentation, collaborative debugging, and AI-assisted troubleshooting. By integrating these capabilities into IDPs, Golden Paths and developers workflows, organizations can bridge the gap between dev and ops and create a culture of proactive problem-solving and continuous improvement.

Ready to boost developer productivity and overcome the IDP observability challenges? Take the next step with Lightrun:

Get a Lightrun demo and learn how to easily debug micro services, Kubernetes, Docker Swarm, ECS, Big Data, serverless, and more: https://lightrun.com/get-a-lightrun-demo/
Explore the Lightrun Playground – experiment with a live app without any configuration: https://playground.lightrun.com/

It’s Really not that Complicated.

You can actually understand what’s going on inside your live applications.

Try Lightrun’s Playground

Deployment Patterns

Environments

IDEs

New!

Early Observability in Platform Engineering: Challenges and Solutions

Shifting Left Observability in Internal Developer Platforms

What Makes Early Observability in IDPs Challenging?

Addressing IDP Challenges with Dynamic Developer Observability

What’s Next?

It’s Really not that Complicated.

Deployment Patterns

Environments

IDEs

New!

Early Observability in Platform Engineering: Challenges and Solutions

Shifting Left Observability in Internal Developer Platforms

What Makes Early Observability in IDPs Challenging?

Addressing IDP Challenges with Dynamic Developer Observability

What’s Next?

Flowing with Your Code: How Lightrun’s Dynamic Traces Help Debug Complex Application Flows

Top 10 Modern Observability Best Practices

Simplifying Java One Liners (Lambda Expressions) Debugging with Lightrun

It’s Really not that Complicated.

Lets Talk!