Kafka Ingestion Peon Tasks Success But Overlord Shows Failure
See original GitHub issueApologies if this breaks any rules, but I tried on the druid forums without much success so trying here to see if I can reach a different audience. Relevant information below and more details in the druid forum post.
- Druid Version:
0.22.1
,0.23.0
- Kafka Ingestion (idempotent producer) - HOURLY
- Overlord type: remote
https://www.druidforum.org/t/kafka-ingestion-peon-tasks-success-but-overlord-shows-failure/7374
In general when we run all our tasks, we start seeing issues between Overlord and MM/Peons. Often times, the Peon will show that the task was successful but the overlord believes it failed and tries to shut it down. And things start to get sluggish with the Overlord and it starts taking a while to recognize completed tasks and tasks that are trying to start which seems to be pointing at a communication/coordination failure between Overlord and MM/Peons. We even see TaskAssignment between Overlord and MM timeouts (PT10M - default is PT5M) occur.
The only thing that seems to be able to help is reducing the number of tasks we have running concurrently by suspending certain supervisors. Which also indicates an issue with the 3 Druid services handling the load of our current ingestion. But according to system metrics, resource usage is not hitting any limits and it still has more compute it can use. It’s odd since we know there are probably a lot of users ingesting more data per hour than us and we don’t see this type of issue in their discussions/white papers.
Any help will definitely be appreciated.
Issue Analytics
- State:
- Created a year ago
- Comments:44 (13 by maintainers)
you should also run sjk for like ~ 5 minutes when peon is not responding. Loading a flame graph with 1 hour of data is very slow.
yes, let’s use the standard 0.23.0 build. btw we would need the flame graphs on the overlord and the tasks that are not responding to pause request.