100% CPU usage when working with a high number of workspaces
See original GitHub issueWe noticed a flaw in the Java SDK. When interacting with a large number of workspaces, the rate-limiting algorithm consume 100% CPU. After profiling our server we noticed that 96% of the CPU is spent on BaseMemoryMetricsDataStore$MaintenanceJob.run()
. Particularly the methods updateCurrentQueueSize
and updateNumberOfLastMinuteRequests
. I believe this is because our app deals with a large number of workspaces (~8,000) and the maintenance job runs every 50ms.
As a temporary solution, we are considering turning off stats on MethodConfig
.
Any other suggestion to mitigate/fix the issue?
The Slack SDK version
1.18.0
Java Runtime version
(Paste the output of java -version
)
OS info
Linux
Steps to reproduce:
Create an app that unfurl links from ~8,000 workspaces.
Expected result:
Reasonable CPU usage.
Actual result:
100% CPU usage after running for 16h. The CPU usage progressively increases until the server restarts.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
Hey Kazuhiro,
Thanks so much for the fix. We’ve been running it in production for a week and the CPU is sitting around 10% now. The CPU still increases linearly but the slope is a lot gentler. We haven’t reenabled stats yet though. I’ll let you know if reenabling stats causes leads to any unexpected behaviour.
Hi @sidneyamani, I’ve merged the PR #934 and released a new version - v1.19.0 onto the Maven Central repository.
I hope the version works very well for your app. Also, if the default configuration isn’t great enough for you, you can adjust the behavior by:
rateLimiterBackgroundJobIntervalMillis
to a longer value than 1,000 milliseconds (the default)statsEnabled: false
to eitherSlackConfig
orMethodsConfig
etc.Refer to the release note for more details. https://github.com/slackapi/java-slack-sdk/releases/tag/v1.19.0
The first one won’t cause any big problem. The only downside would be the rate limiter in this SDK can be more conservative on the intervals between the same API calls. As for the second one, with this way, your app will be responsible for handling rate-limited error patterns.
Thanks again for reporting this issue. I hope the fix I applied this time helps.