Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overlords fail to elect any leaders preventing ingestion tasks from beginning

See original GitHub issue

Affected Version

0.22.1

Description

We have 2 coordinators acting as overlord. With no apparent cause, neither overlords were being elected as leader. We performed a rolling restart of every Druid service and our Zookeeper nodes, but nothing changed. We observed this log message:

listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z] version[2022-03-19T01:36:35.743Z] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z}

We were only able to clear the problem and allow jobs to proceed by logging into the Postgres metadb and truncating the druid_tasklocks table.

Looking at the offending lock, we see it contains this information:

{"granularity":"timeChunk","type":"EXCLUSIVE","groupId":"index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z","dataSource":"events-hour","interval":"2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z","version":"2022-03-19T01:36:35.743Z","priority":50,"revoked":true}

The logs for the associated task begin like this:

Mar 19, 2022 @ 01:36:35.773 Task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z] started.
Mar 19, 2022 @ 01:36:37.933 Revoking task lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] for task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z]
Mar 19, 2022 @ 01:36:37.934 Replacing an existing lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] with a new lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=true}] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z
Mar 19, 2022 @ 01:36:37.957 Revoked taskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}]

Around this time, one of the coordinator/overlords was restarted. The leader election seems to have occurred cleanly, but then it immediately began spamming the log from above.

Issue Analytics

State:
Created 2 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

1reaction

forzamehlanocommented, Aug 15, 2022

Ran into this exact same issue on 0.23.0

0reactions

ThomasBarachcommented, Oct 21, 2022

Good to know, thanks @abhishekagarwal87 😃

Top Results From Across the Web

Overlord Scaling Issue - Ingestion - Druid Forum

I've below Druid Stack. 1 Overlord - m5.12xlarge 50 MiddleManagers - i3en.24xlarge. I'm submitting S3 ingestion tasks of parquet files using ...

Druid Real time ingestion issue - Two overlords taking the ...

I am using Druid 0.8.1 and tranquility 2.10:0.4.2 + Storm to ingest real ... As a result all my real time index tasks...

Kafka Indexing Service - Apache Druid

The Kafka indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from Kafka by managing the creation and lifetime ......

Running a Cost-Effective Druid Cluster on AWS Spot Instances

Let's start with ingestion service, i.e. Middle Managers. We need Middle Manager servers only for executing tasks (ingestion, kill, compaction).

Solved: Failing to Submit Index Task to Druid's Overlord v...

After I store the USGS data into a local file and submit an ingestion spec referring to the local file to Druid Overlord,...