question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overlords fail to elect any leaders preventing ingestion tasks from beginning

See original GitHub issue

Affected Version

0.22.1

Description

We have 2 coordinators acting as overlord. With no apparent cause, neither overlords were being elected as leader. We performed a rolling restart of every Druid service and our Zookeeper nodes, but nothing changed. We observed this log message:

listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z] version[2022-03-19T01:36:35.743Z] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z}

We were only able to clear the problem and allow jobs to proceed by logging into the Postgres metadb and truncating the druid_tasklocks table.

Looking at the offending lock, we see it contains this information:

{"granularity":"timeChunk","type":"EXCLUSIVE","groupId":"index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z","dataSource":"events-hour","interval":"2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z","version":"2022-03-19T01:36:35.743Z","priority":50,"revoked":true}

The logs for the associated task begin like this:

Mar 19, 2022 @ 01:36:35.773 Task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z] started.
Mar 19, 2022 @ 01:36:37.933 Revoking task lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] for task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z]
Mar 19, 2022 @ 01:36:37.934 Replacing an existing lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] with a new lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=true}] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z
Mar 19, 2022 @ 01:36:37.957 Revoked taskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}]

Around this time, one of the coordinator/overlords was restarted. The leader election seems to have occurred cleanly, but then it immediately began spamming the log from above.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
forzamehlanocommented, Aug 15, 2022

Ran into this exact same issue on 0.23.0

0reactions
ThomasBarachcommented, Oct 21, 2022

Good to know, thanks @abhishekagarwal87 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Overlord Scaling Issue - Ingestion - Druid Forum
I've below Druid Stack. 1 Overlord - m5.12xlarge 50 MiddleManagers - i3en.24xlarge. I'm submitting S3 ingestion tasks of parquet files using ...
Read more >
Druid Real time ingestion issue - Two overlords taking the ...
I am using Druid 0.8.1 and tranquility 2.10:0.4.2 + Storm to ingest real ... As a result all my real time index tasks...
Read more >
Kafka Indexing Service - Apache Druid
The Kafka indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from Kafka by managing the creation and lifetime ......
Read more >
Running a Cost-Effective Druid Cluster on AWS Spot Instances
Let's start with ingestion service, i.e. Middle Managers. We need Middle Manager servers only for executing tasks (ingestion, kill, compaction).
Read more >
Solved: Failing to Submit Index Task to Druid's Overlord v...
After I store the USGS data into a local file and submit an ingestion spec referring to the local file to Druid Overlord,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found