Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

See original GitHub issue

Apologies if this breaks any rules, but I tried on the druid forums without much success so trying here to see if I can reach a different audience. Relevant information below and more details in the druid forum post.

Druid Version: 0.22.1, 0.23.0
Kafka Ingestion (idempotent producer) - HOURLY
Overlord type: remote

https://www.druidforum.org/t/kafka-ingestion-peon-tasks-success-but-overlord-shows-failure/7374

In general when we run all our tasks, we start seeing issues between Overlord and MM/Peons. Often times, the Peon will show that the task was successful but the overlord believes it failed and tries to shut it down. And things start to get sluggish with the Overlord and it starts taking a while to recognize completed tasks and tasks that are trying to start which seems to be pointing at a communication/coordination failure between Overlord and MM/Peons. We even see TaskAssignment between Overlord and MM timeouts (PT10M - default is PT5M) occur.

The only thing that seems to be able to help is reducing the number of tasks we have running concurrently by suspending certain supervisors. Which also indicates an issue with the 3 Druid services handling the load of our current ingestion. But according to system metrics, resource usage is not hitting any limits and it still has more compute it can use. It’s odd since we know there are probably a lot of users ingesting more data per hour than us and we don’t see this type of issue in their discussions/white papers.

Any help will definitely be appreciated.