question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

See original GitHub issue

Apologies if this breaks any rules, but I tried on the druid forums without much success so trying here to see if I can reach a different audience. Relevant information below and more details in the druid forum post.

  • Druid Version: 0.22.1, 0.23.0
  • Kafka Ingestion (idempotent producer) - HOURLY
  • Overlord type: remote

https://www.druidforum.org/t/kafka-ingestion-peon-tasks-success-but-overlord-shows-failure/7374

In general when we run all our tasks, we start seeing issues between Overlord and MM/Peons. Often times, the Peon will show that the task was successful but the overlord believes it failed and tries to shut it down. And things start to get sluggish with the Overlord and it starts taking a while to recognize completed tasks and tasks that are trying to start which seems to be pointing at a communication/coordination failure between Overlord and MM/Peons. We even see TaskAssignment between Overlord and MM timeouts (PT10M - default is PT5M) occur.

The only thing that seems to be able to help is reducing the number of tasks we have running concurrently by suspending certain supervisors. Which also indicates an issue with the 3 Druid services handling the load of our current ingestion. But according to system metrics, resource usage is not hitting any limits and it still has more compute it can use. It’s odd since we know there are probably a lot of users ingesting more data per hour than us and we don’t see this type of issue in their discussions/white papers.

Any help will definitely be appreciated.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:44 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
abhishekagarwal87commented, Jul 20, 2022

you should also run sjk for like ~ 5 minutes when peon is not responding. Loading a flame graph with 1 hour of data is very slow.

1reaction
abhishekagarwal87commented, Jul 20, 2022

yes, let’s use the standard 0.23.0 build. btw we would need the flame graphs on the overlord and the tasks that are not responding to pause request.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kafka Ingestion Peon Tasks Success But Overlord Shows ...
We recently started having intermittent problems with Kafka tasks failing but seems irregular because the Peon tasks logs shows “SUCCESS” ...
Read more >
[GitHub] [druid] pchang388 commented on issue #12701: Kafka ...
... #12701: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure ... Since the Peon seems to be unable to pause in a...
Read more >
Task killed by Overlord because it is not responding to Pause
Tasks are being killed by overlord because peon is not responding ... Kafka Ingestion Peon Tasks Success But Overlord Shows Failure #12701.
Read more >
Kafka Indexing Service error occurs: "Cannot use existing ...
We have run Kafka Indexing Service on Druid for nearly a month, and it ran basically well. But a few days ago all...
Read more >
Solved: Failing to Submit Index Task to Druid's Overlord v...
After I store the USGS data into a local file and submit an ingestion spec referring to the local file to Druid Overlord,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found