Transport response handler not found
See original GitHub issueCrateDB version: 4.1.2
Environment description:
Centos 7.latest OpenJDK 11.0.7 Data nodes: 8cpu, 64gb ram Node makeup: 48 data, 3 master, 2 ingest, 2 query While we have 48 data nodes, it’s essentially over 2 availability zones (24 per zone)
Problem description: Our cluster health will occasionally get stuck in yellow and will require us to restart crate on the affected nodes for the health to go back to green. We typically have a nagios check that runs the alter cluster command which ends up resolving the problem, however, there are cases that require manual intervention.
We typically see shards stay unassigned until we run ALTER CLUSTER REROUTE RETRY FAILED
. Some logs from a related issue #9748
shard has exceeded the maximum number of retries [20] on failed allocation attempts - manually execute 'alter cluster....' [unassigned_info[[reason=ALLOCATION_FAILED], at ..... failed to create shard, failure IOException[failed to obtain in-memory shard lock]...
[WARN ][o.e.i.c.IndicesClusterStateService] [hostname][[namespace..partitioned.tablename.someuuid][1]] marking and sending shard failed due to [failed to create shard] java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:358)
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:440)
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:112)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:551)
...
[INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [hostname][namespace..partitioned.tablename.someuuid][1]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [namespace..partitioned.tablename.someuuid][1]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [read metadata snapshot]
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:748)
at org.elasticsearch.NodeEnvironment.shardLock(NodeEnvironment.java:663)
at org.elasticsearch.index.Store.readMetadataSnapshot(Store.java:443)
....
AFTER running the retry command we get shards stuck in the RELOCATING
state with the following log message that emits at a very fast rate:
[WARN ][o.e.t.TransportService][node]Transport response handler not found of id [9285317]
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:17 (8 by maintainers)
Top GitHub Comments
We’ve finally found the issue related to the
Transport handler not found ...
log entries, see https://github.com/crate/crate/pull/10797. Thank you for reporting, it was indeed an issue.@seut I will get it to you via my colleague @rene-stiams .