Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance Issue / Ktor & Netty

See original GitHub issue

Ktor Version

1.2.0

Ktor Engine Used(client or server and name)

Netty Server

JVM Version, Operating System and Relevant Context

11.0.2, Debian (docker)

4 Cores -Xmx2G (parallelism = 4)

Default settings

Feedback

For most of the cases performance is pretty good, we get < 50ms including db save (all done asynchronously). Unfortunately from time to time, we are getting requests which stuck for > 1s, in worst cases even > 20 seconds. Server got the request, but processing route wasn’t invoked yet. In below there is part of NewRelic instrumentation stuck for such slow call, the problem is that Ktor is not yet well instrumented, and the only consecutive calls I see taking that time are below:

0 | 0.00% | Truncated: NettyUpstreamDispatcher |   | 0.000  s
-- | -- | -- | -- | --
0 | 0.00% | HttpServerExpectContinueHandler.channelRead() | Async | 0.000  s
0 | 0.00% | HttpServerExpectContinueHandler.channelRead() | Async | 0.000  s
0 | 0.00% | RequestBodyHandler.channelRead() | Async | 0.000  s
16.0 | 0.08% | NettyApplicationCallHandler.channelRead() | Async | 20.025  s
16.0 | 0.08% | NewRelicFeature.wrapIntoNewRelicTransaction() |   | 20.025  s
16.0 | 0.08% | com.revolut.eventstore.api.write.EventsControllerKt/saveEvent |   | 20.025  s
1.0 | 0.00% | Application code (in com.*.api.write.EventsControllerKt/saveEvent)

What I am trying to understand is - what could happen between:

0 | 0.00% | RequestBodyHandler.channelRead() | Async | 0.000  s
-- | -- | -- | -- | --
16.0 | 0.08% | NettyApplicationCallHandler.channelRead() | Async | 20.025  s

Why it took such long? It’s pretty hard to understand where may be any blocking/under sourced part - so I would really appreciate help with it

Issue Analytics

State:
Created 4 years ago
Comments:13 (2 by maintainers)

Top GitHub Comments

1reaction

Hc747commented, Jun 12, 2019

Finally, after using shareWorkGroup = true I got
RequestBodyHandler.channelRead() | Async | 0.000  s (timesttamp)
ParametersCloseAsync context = nettyWorkerPool-3-16
// -- | -GAP- | -- 
NettyApplicationCallHandler.channelRead() | Async | 3.795  s (timestamp)
ParametersCloseAsync context = nettyWorkerPool-3-17
Which means it’s somewhere in selection process between above two, looking for some ideas how to deal with it 🤔

Potential improvement may be switch select to EPoll or KQueue - but it has to be added to Ktor

See #1124 - should be implemented soon!

0reactions

AsiaMacommented, Jul 1, 2021

Seems the issue is mostly related to sizes of groups, do you have any recommendation for those 3 settings? Basically as I do a lot of asynchronous code, the recommended settings of:
embeddedServer(AnyEngine, configure = {
    connectionGroupSize = parallelism / 2 + 1
    workerGroupSize = parallelism / 2 + 1
    callGroupSize = parallelism 
})
Doesn’t work, utilization of CPU is pretty low with it and I face above issue, only when I manually increase at least callGroupSize to 16 it starts to improve.

Summarising, now I have 4 Cores (parallelism), 2gb ram, using netty, and much more optimal results for
 connectionGroupSize = parallelism / 2 + 1 = 3 // default
 workerGroupSize = parallelism / 2 + 1 = 3  // default
 callGroupSize = 16  // manually set
I would need to analyse your code to understand what is best ratio between those 3 - again would repeat that your default settings seems not ideal at least for Netty

Your code may have blocked code.