Overlords fail to elect any leaders preventing ingestion tasks from beginning
See original GitHub issueAffected Version
0.22.1
Description
We have 2 coordinators acting as overlord. With no apparent cause, neither overlords were being elected as leader. We performed a rolling restart of every Druid service and our Zookeeper nodes, but nothing changed. We observed this log message:
listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z] version[2022-03-19T01:36:35.743Z] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z}
We were only able to clear the problem and allow jobs to proceed by logging into the Postgres metadb and truncating the druid_tasklocks
table.
Looking at the offending lock, we see it contains this information:
{"granularity":"timeChunk","type":"EXCLUSIVE","groupId":"index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z","dataSource":"events-hour","interval":"2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z","version":"2022-03-19T01:36:35.743Z","priority":50,"revoked":true}
The logs for the associated task begin like this:
Mar 19, 2022 @ 01:36:35.773 Task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z] started.
Mar 19, 2022 @ 01:36:37.933 Revoking task lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] for task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z]
Mar 19, 2022 @ 01:36:37.934 Replacing an existing lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] with a new lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=true}] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z
Mar 19, 2022 @ 01:36:37.957 Revoked taskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}]
Around this time, one of the coordinator/overlords was restarted. The leader election seems to have occurred cleanly, but then it immediately began spamming the log from above.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (1 by maintainers)
Top Results From Across the Web
Overlord Scaling Issue - Ingestion - Druid Forum
I've below Druid Stack. 1 Overlord - m5.12xlarge 50 MiddleManagers - i3en.24xlarge. I'm submitting S3 ingestion tasks of parquet files using ...
Read more >Druid Real time ingestion issue - Two overlords taking the ...
I am using Druid 0.8.1 and tranquility 2.10:0.4.2 + Storm to ingest real ... As a result all my real time index tasks...
Read more >Kafka Indexing Service - Apache Druid
The Kafka indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from Kafka by managing the creation and lifetime ......
Read more >Running a Cost-Effective Druid Cluster on AWS Spot Instances
Let's start with ingestion service, i.e. Middle Managers. We need Middle Manager servers only for executing tasks (ingestion, kill, compaction).
Read more >Solved: Failing to Submit Index Task to Druid's Overlord v...
After I store the USGS data into a local file and submit an ingestion spec referring to the local file to Druid Overlord,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ran into this exact same issue on 0.23.0
Good to know, thanks @abhishekagarwal87 😃