Master lost worker connection due to grpc Error
See original GitHub issueAlluxio Version: 2.2.0
Describe the bug
1.No worker registered into the alluxio
docker exec -it alluxio-master bash
bash-4.4# alluxio fsadmin report capacity
No workers found.
- In the side of worker, I notice the log
2019-12-14 04:02:00,857 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-8879748846611160559. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2019-12-14 04:04:01,044 ERROR ProcessUtils - Uncaught exception while running Alluxio worker @192.168.0.117:29999, stopping it and exiting. Exception "alluxio.exception.status.UnavailableException: Failed after 44 attempts: alluxio.exception.status.CancelledException: HTTP/2 error code: CANCEL
Received Rst Stream", Root Cause "io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
Received Rst Stream"
alluxio.exception.status.UnavailableException: Failed after 44 attempts: alluxio.exception.status.CancelledException: HTTP/2 error code: CANCEL
Received Rst Stream
at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:399)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:344)
at alluxio.worker.block.BlockMasterClient.register(BlockMasterClient.java:222)
at alluxio.worker.block.BlockMasterSync.registerWithMaster(BlockMasterSync.java:106)
at alluxio.worker.block.BlockMasterSync.<init>(BlockMasterSync.java:93)
at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlockWorker.java:214)
at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlockWorker.java:77)
at alluxio.Registry.start(Registry.java:131)
at alluxio.worker.AlluxioWorkerProcess.startWorkers(AlluxioWorkerProcess.java:283)
at alluxio.worker.AlluxioWorkerProcess.start(AlluxioWorkerProcess.java:235)
at alluxio.ProcessUtils.run(ProcessUtils.java:35)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:71)
Caused by: alluxio.exception.status.CancelledException: HTTP/2 error code: CANCEL
Received Rst Stream
at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:125)
at alluxio.exception.status.AlluxioStatusException.fromStatusRuntimeException(AlluxioStatusException.java:210)
at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:384)
... 11 more
Caused by: io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
Received Rst Stream
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
at alluxio.grpc.BlockMasterWorkerServiceGrpc$BlockMasterWorkerServiceBlockingStub.registerWorker(BlockMasterWorkerServiceGrpc.java:477)
at alluxio.worker.block.BlockMasterClient.lambda$register$6(BlockMasterClient.java:223)
at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:382)
... 11 more
2019-12-14 04:04:01,062 INFO GrpcDataServer - Shutting down Alluxio worker gRPC server at 0.0.0.0/0.0.0.0:29999.
And In the alluxio worker directory, the block number is 950971
[root@iZhp3bku0ru8vuxq08loruZ alluxioworker]# ls -ltr | wc -l
950971
It may relate to the grpc issue: https://github.com/grpc/grpc-java/issues/2901
Any configuration suggestions?
To Reproduce 1.Set underlayer filesystem
mkdir /var/lib/docker/alluxio-ufs/
2.Generate the data (157G, 3804847 files)
cd /var/lib/docker/alluxio-ufs/
wget http://kubeflow-sh.oss-cn-shanghai.aliyuncs.com/insight-face-data.tar.gz
tar -xvf insight-face-data.tar.gz
ls -ltr insight-face-data/images_2/ | wc -l
3804847
du -sh insight-face-data/images_2/
92G insight-face-data/images_2/
- Start Alluxio
# Launch the Alluxio Master
docker run -d \
--net=host \
-u=0 \
--name=alluxio-master \
--pid=host \
--security-opt=seccomp:unconfined \
-v /var/lib/docker/alluxio-journal:/journal \
-v /var/lib/docker/alluxio-ufs:/opt/alluxio/underFSStorage \
-v /dev/shm:/dev/shm \
-e ALLUXIO_JAVA_OPTS="-Dalluxio.master.hostname=$(hostname -i) -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.security.stale.channel.purge.interval=365d -Dalluxio.master.mount.table.root.ufs=/opt/alluxio/underFSStorage -Dalluxio.user.block.master.client.threads=32 -Dalluxio.user.block.size.bytes.default=32MB -Dalluxio.worker.file.buffer.size=33MB -Dalluxio.master.journal.folder=/journal -Dalluxio.master.journal.type=UFS -Dalluxio.user.block.write.location.policy.class=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.user.file.master.client.threads=128 -Dalluxio.user.file.passive.cache.enabled=false -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.user.network.reader.chunk.size.bytes=32MB -Dalluxio.job.worker.threadpool.size=30 -Dalluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.worker.block.master.client.pool.size=128 -Dalluxio.user.metadata.cache.max.size=1000000 -Dalluxio.user.direct.memory.io.enabled=true -Dalluxio.user.metadata.cache.enabled=true -Xms8G -Xmx8G " \
alluxio/alluxio:2.2.0-SNAPSHOT master --no-format
#Launch the Alluxio Worker
docker run -d \
--net=host \
-u=0 \
--pid=host \
--name=alluxio-worker \
-v /dev/shm:/dev/shm \
-v /var/lib/docker/alluxio-ufs:/opt/alluxio/underFSStorage \
--security-opt=seccomp:unconfined \
-e ALLUXIO_JAVA_OPTS="-Dalluxio.master.hostname=$(hostname -i) -Dalluxio.worker.hostname=$(hostname -i) -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.worker.tieredstore.levels=1 -Dalluxio.worker.tieredstore.level0.alias=MEM -Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM -Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm -Dalluxio.worker.tieredstore.level0.dirs.quota=200GB -Dalluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level0.watermark.low.ratio=0.7 -Dalluxio.fuse.debug.enabled=false -Dalluxio.master.journal.folder=/journal -Dalluxio.master.journal.type=UFS -Dalluxio.job.worker.threadpool.size=30 -Dalluxio.security.stale.channel.purge.interval=365d -Dalluxio.user.block.master.client.threads=32 -Dalluxio.user.block.size.bytes.default=32MB -Dalluxio.worker.file.buffer.size=33MB -Dalluxio.user.block.write.location.policy.class=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.user.file.master.client.threads=128 -Dalluxio.user.file.passive.cache.enabled=false -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.user.network.reader.chunk.size.bytes=32MB -Dalluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.worker.block.master.client.pool.size=128 -Dalluxio.user.metadata.cache.max.size=1000000 -Dalluxio.user.direct.memory.io.enabled=true -Dalluxio.user.metadata.cache.enabled=true -Xms6G -Xmx6G " \
alluxio/alluxio:2.2.0-SNAPSHOT worker --no-format
- Check the status
# docker exec -it alluxio-master bash
bash-4.4# alluxio version -r
2.2.0-SNAPSHOT-f6738a5e4c9adf7a2f02023787102eff53f1193d
bash-4.4# alluxio fsadmin report capacity
Capacity information for all workers:
Total Capacity: 200.00GB
Tier: MEM Size: 200.00GB
Used Capacity: 0B
Tier: MEM Size: 0B
Used Percentage: 0%
Free Percentage: 100%
Worker Name Last Heartbeat Storage MEM
192.168.0.119 0 capacity 200.00GB
used 0B (0%)
- Check the data info
bash-4.4# alluxio fs ls /insight-face-data/images_2 | wc -l
3804847
time alluxio fs du -sh /insight-face-data/images_2
File Size In Alluxio Path
84.16GB 0B (0%) /insight-face-data/images_2
real 3m55.119s
user 56m47.199s
sys 0m30.750s
- Load the data, and find it hang
time /opt/alluxio/bin/alluxio fs distributedLoad --replication 1 /insight-face-data/images_2
- And it took about more than 7 hours, I check the worker directory.
# pwd
/dev/shm/alluxioworker
# ls -ltr | wc -l
1440963
# du -sh /dev/shm/alluxioworker
37G /dev/shm/alluxioworker
- Stop the loading, I notice it took about 489m59.270s, and loaded 33.25GB
/insight-face-data/images_2/363666.png loading
/insight-face-data/images_2/2442193.png loading
/insight-face-data/images_2/1709976.png loading
/insight-face-data/images_2/1329333.png loading
^C/insight-face-data/images_2/3702989.png loading
real 489m59.270s
user 14384m34.237s
sys 18m10.609s
bash-4.4# alluxio fsadmin report capacity
Capacity information for all workers:
Total Capacity: 200.00GB
Tier: MEM Size: 200.00GB
Used Capacity: 33.25GB
Tier: MEM Size: 33.25GB
Used Percentage: 16%
Free Percentage: 84%
Worker Name Last Heartbeat Storage MEM
192.168.0.117 0 capacity 200.00GB
used 33.25GB (16%)
- Restart the worker
docker stop alluxio-worker
sleep 120
docker start alluxio-worker
- And access the worker, I can see the error:
Received Rst Stream", Root Cause "io.grpc.StatusRuntimeException: CAN
Received Rst Stream"
alluxio.exception.status.UnavailableException: Failed after 44 attemp
Received Rst Stream
at alluxio.AbstractClient.retryRPCInternal(AbstractClient.jav
at alluxio.AbstractClient.retryRPC(AbstractClient.java:344)
at alluxio.worker.block.BlockMasterClient.register(BlockMaste
at alluxio.worker.block.BlockMasterSync.registerWithMaster(Bl
at alluxio.worker.block.BlockMasterSync.<init>(BlockMasterSyn
at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlock
at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlock
at alluxio.Registry.start(Registry.java:131)
at alluxio.worker.AlluxioWorkerProcess.startWorkers(AlluxioWo
at alluxio.worker.AlluxioWorkerProcess.start(AlluxioWorkerPro
at alluxio.ProcessUtils.run(ProcessUtils.java:35)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:78)
Caused by: alluxio.exception.status.CancelledException: HTTP/2 error
Received Rst Stream
at alluxio.exception.status.AlluxioStatusException.from(Allux
at alluxio.exception.status.AlluxioStatusException.fromStatus
at alluxio.AbstractClient.retryRPCInternal(AbstractClient.jav
... 11 more
Caused by: io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error co
Received Rst Stream
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCa
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.jav
at alluxio.grpc.BlockMasterWorkerServiceGrpc$BlockMasterWorke
at alluxio.worker.block.BlockMasterClient.lambda$register$6(B
at alluxio.AbstractClient.retryRPCInternal(AbstractClient.jav
... 11 more
Expected behavior A clear and concise description of what you expected to happen.
Urgency Describe the impact and urgency of the bug.
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Update to gRPC logs GOAWAY with error code ... - GitLab
This sounds to me like the server closes the connection on the client, which has to then re-establish it. I don't think this...
Read more >too_many_pings error in gRPC service calls - Stack Overflow
When such a stale connection is detected, server sends this error and discontinues with client and the client is not able to further...
Read more >Unable to connect to Cloud Run gRPC server
I believe the problem is in the network setting but can't find the way to connect and call a gRPC function from the...
Read more >Is grpc a point of failure when embedding the workflow in a ...
Yes, if the connection between the worker and the broker goes down, no work happens. So, Zeebe allows multiple distributed stateless workers ...
Read more >GRPC Core: Status codes and their use in gRPC
Code Number Description
OK 0 Not an error; returned on success.
FAILED_PRECONDITION 9
OUT_OF_RANGE 11
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@cheyang This should be addressed in: https://github.com/Alluxio/alluxio/pull/11305 You will need to increase the configuration value to a larger number
Closing this for now, please re-open if you find further issues.
I’ve already increase the Heap size to 16GB, and it’s still the same issue. According to @apc999 , it’s due to the number of alluxio block.