question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

constant amount of underreplicated records

See original GitHub issue

CrateDB version: 2.2.7 docker centos 7.4 12 Nodes docker 10G heap size crate-python 0.21.1 Problem description:

constant amount of underreplicated records that never decreases. image

hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | org.elasticsearch.env.ShardLockObtainFailedException: [.partitioned.events_schelling.04732d1p6cqjidho60o30c1g][2]: obtaining shard lock timed out after 5000ms
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:726) ~[crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:645) ~[crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:414) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]


today I also found this in the log

hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | [2018-02-09T11:12:33,542][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [hephaistos1crate] [.partitioned.events_schelling.04732d1n6ssjae1k60o30c1g][2]: failed to obtain shard lock
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | org.elasticsearch.env.ShardLockObtainFailedException: [.partitioned.events_schelling.04732d1n6ssjae1k60o30c1g][2]: obtaining shard lock timed out after 5000ms
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:726) ~[crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:645) ~[crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:414) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [crate-app-2.2.7.jar:2.2.7]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
hephaistos-crate.0.05njw74j4tug@hephaistos1.localdomain    | 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

Steps to reproduce: possible reason: upgraded from 2.2.4 to 2.2.7

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:20 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
mfusseneggercommented, Feb 12, 2018

I’m now able to reproduce that there are entries stuck in sys.jobs - I haven’t yet done enough testing to see if I can also get the shard lock errors, but it’s likely related.

We’ll also take a closer look at the jmap issue or come up with an alternative recommendation on how to get heap dumps if CrateDB is used inside docker. But for now I don’t need a heap dump anymore.

Thanks so far.

0reactions
sonix07commented, Feb 27, 2018

@mfussenegger just confirming that version 2.3.3 fixed the the problem. This issue can be closed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Replication Dashboard | CockroachDB Docs
Under-replicated Ranges: When a cluster is first initialized, the few default starting ranges ... Under-replicated, The number of under-replicated ranges.
Read more >
5 Common Pitfalls When Using Apache Kafka - Confluent
Any number of under-replicated partitions is a sign of an unhealthy cluster, as it implies that your data is not fully replicated as ......
Read more >
A Mechanism for Controlled Breakage of Under-replicated ...
Timely and orderly rounds of DNA replication and segregation require oscillatory phosphorylation events carried out by cyclin-dependent kinases ...
Read more >
Kafka - Unravel Data
A total number of under-replicated partitions. This metric indicates if the partitions ... A Topic is a category into which the Kafka records...
Read more >
Kafka performance monitoring metrics | MetricFire Blog
A record has four attributes: key, value, timestamp, and titles. ... Under-replicated partitions metrics are a leading indicator of one or ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found