Inferring spans representing method executions based on a statistical profiler
See original GitHub issueMotivation Our agents mostly capture I/O events like incoming and outgoing HTTP requests. So when most of the time s spend there, it’s easy to troubleshoot latency issues. However, if the application is slow because of inefficient code, users have to manually instrument the application. But in some cases, it’s not feasible to do that, for example, if the code base is huge and it’s unclear which methods cause the slowdown.
The trace_methods
configuration option is often used to match a large portion, if not all methods within a codebase. The documentation warns about the fact that this can significantly increase the overhead and degrade the application’s performance.
The slowdown is inherent to the way trace_methods
works. All methods which match the trace_methods
expression are instrumented so that a timer is started at the start and stopped at the end of the method. If the execution time was significant enough (trace_methods_duration_threshold
), a span is created. When frequently executing methods are instrumented, it can significantly slow down the application by adding just a little overhead to each invocation. Instrumenting the method can also hinder optimizations the JIT could normally do, like inlining.
The new approach
This issue is about adding an alternative to trace_methods
which does not require to instrument any methods. Instead, a sampling aka statistical profiler would be used as a foundation. These profilers work by gathering the stack trace of the application at frequent intervals, like every 20ms.
By correlating the stack traces with when which span has been active on which thread, we can create a call tree based on the stack traces, correlate them to a span and create spans for it. As the UI is concerned, those spans look just like regular spans so no changes are required in the UI and the APM Server. At a later stage, we could make the UI aware of the profiler-inferred spans and display them with a special icon or color.
The tradeoffs The duration of the spans won’t be as accurate as we’re not exactly measuring the execution time but rather estimate the duration based on the number of consecutive stack traces a method has been present.
To reduce the overhead, the profiler won’t be active all the time. Instead, it’s active for the first 10 seconds of every minute, by default. Only transactions that happen within a profiling session will have profiler-inferred spans.
Q&A
- Will
trace_methods
be removed? There are no plans to remove that option as it can still be useful in combination or instead of profiler-inferred spans. - How can I try this out? Use a snapshot form https://github.com/elastic/apm-agent-java/pull/972 or build the https://github.com/felixbarny/apm-agent-java/tree/inferred-spans branch.
- How does this relate to https://github.com/elastic/apm/issues/121?
- The CPU profiling proposal is about having a macro-level view of what the service is doing. This is great to find out about the hotspots of an application. Optimizing those can have a big overall effect on the application. However, it’s less useful to troubleshoot latency if the application is mostly idle waiting for I/O.
- This issue is about creating regular spans for long executing methods. It’s mostly used to troubleshoot latency for a specific instance of a transaction.
- There’s no intention to deprecate one in favor of the other. Both are very useful tools to have at your belt to optimize an application.
- Both can use the same underlying sampling profiler. For the CPU profiling, the stack trace is only considered for threads in a
RUNNABLE
thread state. Also, for CPU profiling, there will be only one flattened data structure, representing a flame graph for the whole profiling session.
- Which settings are there to configure the profiler? See the docs preview
Screenshot
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:23 (16 by maintainers)
Top GitHub Comments
Update: I was able to integrate async-profiler with the agent 🎉
Async-profiler has much lower overhead than
ThreadMXBean#getThreadInfo
and does not rely on safepoints.The biggest downside is that async-profiler does not work on Windows. I think it still makes sense to only support async-profiler and not have a fallback to
ThreadMXBean#getThreadInfo
on Windows. The time to reach a safe point (which means a stop-the-world pause for the application) is quite unpredictable and can regularly be as high as 5ms.How it works in a nutshell:
TODOs
Gotchas
Sure!
Here’s a snapshot build with async-profiler: https://github.com/elastic/apm-agent-java/pull/983#issuecomment-573329916