Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Master lost worker connection due to grpc Error

See original GitHub issue

Alluxio Version: 2.2.0

Describe the bug

1.No worker registered into the alluxio

docker exec -it alluxio-master bash
bash-4.4# alluxio fsadmin report capacity
No workers found.

In the side of worker, I notice the log

2019-12-14 04:02:00,857 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-8879748846611160559. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2019-12-14 04:04:01,044 ERROR ProcessUtils - Uncaught exception while running Alluxio worker @192.168.0.117:29999, stopping it and exiting. Exception "alluxio.exception.status.UnavailableException: Failed after 44 attempts: alluxio.exception.status.CancelledException: HTTP/2 error code: CANCEL
Received Rst Stream", Root Cause "io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
Received Rst Stream"
alluxio.exception.status.UnavailableException: Failed after 44 attempts: alluxio.exception.status.CancelledException: HTTP/2 error code: CANCEL
Received Rst Stream
	at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:399)
	at alluxio.AbstractClient.retryRPC(AbstractClient.java:344)
	at alluxio.worker.block.BlockMasterClient.register(BlockMasterClient.java:222)
	at alluxio.worker.block.BlockMasterSync.registerWithMaster(BlockMasterSync.java:106)
	at alluxio.worker.block.BlockMasterSync.<init>(BlockMasterSync.java:93)
	at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlockWorker.java:214)
	at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlockWorker.java:77)
	at alluxio.Registry.start(Registry.java:131)
	at alluxio.worker.AlluxioWorkerProcess.startWorkers(AlluxioWorkerProcess.java:283)
	at alluxio.worker.AlluxioWorkerProcess.start(AlluxioWorkerProcess.java:235)
	at alluxio.ProcessUtils.run(ProcessUtils.java:35)
	at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:71)
Caused by: alluxio.exception.status.CancelledException: HTTP/2 error code: CANCEL
Received Rst Stream
	at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:125)
	at alluxio.exception.status.AlluxioStatusException.fromStatusRuntimeException(AlluxioStatusException.java:210)
	at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:384)
	... 11 more
Caused by: io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
Received Rst Stream
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
	at alluxio.grpc.BlockMasterWorkerServiceGrpc$BlockMasterWorkerServiceBlockingStub.registerWorker(BlockMasterWorkerServiceGrpc.java:477)
	at alluxio.worker.block.BlockMasterClient.lambda$register$6(BlockMasterClient.java:223)
	at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:382)
	... 11 more
2019-12-14 04:04:01,062 INFO  GrpcDataServer - Shutting down Alluxio worker gRPC server at 0.0.0.0/0.0.0.0:29999.

And In the alluxio worker directory, the block number is 950971

[root@iZhp3bku0ru8vuxq08loruZ alluxioworker]# ls -ltr | wc -l
950971

It may relate to the grpc issue: https://github.com/grpc/grpc-java/issues/2901

Any configuration suggestions?

To Reproduce 1.Set underlayer filesystem

mkdir /var/lib/docker/alluxio-ufs/

2.Generate the data (157G, 3804847 files)

cd /var/lib/docker/alluxio-ufs/
wget http://kubeflow-sh.oss-cn-shanghai.aliyuncs.com/insight-face-data.tar.gz
tar -xvf insight-face-data.tar.gz

ls -ltr insight-face-data/images_2/ | wc -l
3804847

du -sh insight-face-data/images_2/
92G	insight-face-data/images_2/

Start Alluxio

# Launch the Alluxio Master
docker run -d  \
    --net=host \
    -u=0 \
    --name=alluxio-master \
    --pid=host \
    --security-opt=seccomp:unconfined \
    -v /var/lib/docker/alluxio-journal:/journal \
    -v /var/lib/docker/alluxio-ufs:/opt/alluxio/underFSStorage \
    -v /dev/shm:/dev/shm \
    -e ALLUXIO_JAVA_OPTS="-Dalluxio.master.hostname=$(hostname -i) -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.security.stale.channel.purge.interval=365d -Dalluxio.master.mount.table.root.ufs=/opt/alluxio/underFSStorage -Dalluxio.user.block.master.client.threads=32 -Dalluxio.user.block.size.bytes.default=32MB -Dalluxio.worker.file.buffer.size=33MB  -Dalluxio.master.journal.folder=/journal -Dalluxio.master.journal.type=UFS -Dalluxio.user.block.write.location.policy.class=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.user.file.master.client.threads=128 -Dalluxio.user.file.passive.cache.enabled=false -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.user.network.reader.chunk.size.bytes=32MB -Dalluxio.job.worker.threadpool.size=30  -Dalluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.worker.block.master.client.pool.size=128 -Dalluxio.user.metadata.cache.max.size=1000000  -Dalluxio.user.direct.memory.io.enabled=true -Dalluxio.user.metadata.cache.enabled=true -Xms8G -Xmx8G " \
    alluxio/alluxio:2.2.0-SNAPSHOT master --no-format

#Launch the Alluxio Worker
docker run -d  \
    --net=host \
    -u=0 \
    --pid=host \
    --name=alluxio-worker \
    -v /dev/shm:/dev/shm \
    -v /var/lib/docker/alluxio-ufs:/opt/alluxio/underFSStorage \
    --security-opt=seccomp:unconfined \
    -e ALLUXIO_JAVA_OPTS="-Dalluxio.master.hostname=$(hostname -i) -Dalluxio.worker.hostname=$(hostname -i) -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.worker.tieredstore.levels=1 -Dalluxio.worker.tieredstore.level0.alias=MEM -Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM -Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm -Dalluxio.worker.tieredstore.level0.dirs.quota=200GB -Dalluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level0.watermark.low.ratio=0.7 -Dalluxio.fuse.debug.enabled=false -Dalluxio.master.journal.folder=/journal -Dalluxio.master.journal.type=UFS -Dalluxio.job.worker.threadpool.size=30  -Dalluxio.security.stale.channel.purge.interval=365d -Dalluxio.user.block.master.client.threads=32 -Dalluxio.user.block.size.bytes.default=32MB -Dalluxio.worker.file.buffer.size=33MB  -Dalluxio.user.block.write.location.policy.class=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.user.file.master.client.threads=128 -Dalluxio.user.file.passive.cache.enabled=false -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.user.network.reader.chunk.size.bytes=32MB -Dalluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy -Dalluxio.worker.block.master.client.pool.size=128 -Dalluxio.user.metadata.cache.max.size=1000000  -Dalluxio.user.direct.memory.io.enabled=true -Dalluxio.user.metadata.cache.enabled=true -Xms6G -Xmx6G " \
    alluxio/alluxio:2.2.0-SNAPSHOT worker --no-format

Check the status

# docker exec -it alluxio-master bash
bash-4.4# alluxio version -r
2.2.0-SNAPSHOT-f6738a5e4c9adf7a2f02023787102eff53f1193d
bash-4.4# alluxio fsadmin report capacity
Capacity information for all workers:
    Total Capacity: 200.00GB
        Tier: MEM  Size: 200.00GB
    Used Capacity: 0B
        Tier: MEM  Size: 0B
    Used Percentage: 0%
    Free Percentage: 100%

Worker Name      Last Heartbeat   Storage       MEM
192.168.0.119    0                capacity      200.00GB
                                  used          0B (0%)

Check the data info

bash-4.4# alluxio fs ls /insight-face-data/images_2 | wc -l
3804847
time alluxio fs du -sh /insight-face-data/images_2
File Size     In Alluxio       Path
84.16GB       0B (0%)          /insight-face-data/images_2

real    3m55.119s
user    56m47.199s
sys     0m30.750s

Load the data, and find it hang

time /opt/alluxio/bin/alluxio fs  distributedLoad --replication 1 /insight-face-data/images_2

And it took about more than 7 hours, I check the worker directory.

# pwd
/dev/shm/alluxioworker
# ls -ltr | wc -l
1440963
# du -sh /dev/shm/alluxioworker
37G	/dev/shm/alluxioworker

Stop the loading, I notice it took about 489m59.270s, and loaded 33.25GB

/insight-face-data/images_2/363666.png loading
/insight-face-data/images_2/2442193.png loading
/insight-face-data/images_2/1709976.png loading
/insight-face-data/images_2/1329333.png loading
^C/insight-face-data/images_2/3702989.png loading

real    489m59.270s
user    14384m34.237s
sys     18m10.609s

bash-4.4# alluxio fsadmin report capacity
Capacity information for all workers:
    Total Capacity: 200.00GB
        Tier: MEM  Size: 200.00GB
    Used Capacity: 33.25GB
        Tier: MEM  Size: 33.25GB
    Used Percentage: 16%
    Free Percentage: 84%

Worker Name      Last Heartbeat   Storage       MEM
192.168.0.117    0                capacity      200.00GB
                                  used          33.25GB (16%)

Restart the worker

docker  stop alluxio-worker
sleep 120
docker start alluxio-worker

And access the worker, I can see the error:

Received Rst Stream", Root Cause "io.grpc.StatusRuntimeException: CAN
Received Rst Stream"
alluxio.exception.status.UnavailableException: Failed after 44 attemp
Received Rst Stream
        at alluxio.AbstractClient.retryRPCInternal(AbstractClient.jav
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:344)
        at alluxio.worker.block.BlockMasterClient.register(BlockMaste
        at alluxio.worker.block.BlockMasterSync.registerWithMaster(Bl
        at alluxio.worker.block.BlockMasterSync.<init>(BlockMasterSyn
        at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlock
        at alluxio.worker.block.DefaultBlockWorker.start(DefaultBlock
        at alluxio.Registry.start(Registry.java:131)
        at alluxio.worker.AlluxioWorkerProcess.startWorkers(AlluxioWo
        at alluxio.worker.AlluxioWorkerProcess.start(AlluxioWorkerPro
        at alluxio.ProcessUtils.run(ProcessUtils.java:35)
        at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:78)
Caused by: alluxio.exception.status.CancelledException: HTTP/2 error
Received Rst Stream
        at alluxio.exception.status.AlluxioStatusException.from(Allux
        at alluxio.exception.status.AlluxioStatusException.fromStatus
        at alluxio.AbstractClient.retryRPCInternal(AbstractClient.jav
        ... 11 more
Caused by: io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error co
Received Rst Stream
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCa
        at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.jav
        at alluxio.grpc.BlockMasterWorkerServiceGrpc$BlockMasterWorke
        at alluxio.worker.block.BlockMasterClient.lambda$register$6(B
        at alluxio.AbstractClient.retryRPCInternal(AbstractClient.jav
        ... 11 more

Expected behavior A clear and concise description of what you expected to happen.

Urgency Describe the impact and urgency of the bug.

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

calvinjiacommented, Apr 16, 2020

@cheyang This should be addressed in: https://github.com/Alluxio/alluxio/pull/11305 You will need to increase the configuration value to a larger number

Closing this for now, please re-open if you find further issues.

0reactions

cheyangcommented, Dec 21, 2019

I’ve already increase the Heap size to 16GB, and it’s still the same issue. According to @apc999 , it’s due to the number of alluxio block.

Top Results From Across the Web

Update to gRPC logs GOAWAY with error code ... - GitLab

This sounds to me like the server closes the connection on the client, which has to then re-establish it. I don't think this...

too_many_pings error in gRPC service calls - Stack Overflow

When such a stale connection is detected, server sends this error and discontinues with client and the client is not able to further...

Unable to connect to Cloud Run gRPC server

I believe the problem is in the network setting but can't find the way to connect and call a gRPC function from the...

Is grpc a point of failure when embedding the workflow in a ...

Yes, if the connection between the worker and the broker goes down, no work happens. So, Zeebe allows multiple distributed stateless workers ...

GRPC Core: Status codes and their use in gRPC

Code Number Description OK 0 Not an error; returned on success. FAILED_PRECONDITION 9 OUT_OF_RANGE 11

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Master lost worker connection due to grpc Error

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Lost Block for file

Use local persistent volumes for embedded journal, single master local journal, off-heap metastore and tiered storage