AsyncBlockRemover stuck not freeing any blocks
See original GitHub issueAlluxio Version: 2.1.1
Describe the bug This issue is reported by @JySongWithZhangCe
alluxio fs free /
freed the root mount. One worker failed to free the space, as illustrated.
In the worker.log there are a lot of below message
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 504054677506 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 502712500226 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 503652024322 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 504591548418 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 503584915458 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 505078087685 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 505010978821 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 504004345861 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 502662168581 is being removed. Current queue size is 320.
AsycBlockRemover gets a lot of remove calls but remove is not making any progress. There are around 85K lines of above message, whereas the queue size stays 320.
Whenever a block is successfully removed by AsyncBlockRemover there should be one below line in the debug log. However compared to 85K above log msgs, there are like 2 blocks successfully removed.
Block XXX is removed in thread XXX.
I did a jstack
on Alluxio worker and ALL remover threads are waiting on the block lock.
"block-removal-service-8" #1097 daemon prio=5 os_prio=0 tid=0x00007f8e0d5be800 nid=0x155c4d waiting on condition [0x00007f8a44ed1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006b416a280> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.Semaphore.acquireUninterruptibly(Semaphore.java:494)
at alluxio.worker.block.ClientRWLock$SessionLock.lock(ClientRWLock.java:92)
at alluxio.worker.block.BlockLockManager.lockBlock(BlockLockManager.java:114)
at alluxio.worker.block.TieredBlockStore.removeBlockInternal(TieredBlockStore.java:835)
at alluxio.worker.block.TieredBlockStore.removeBlock(TieredBlockStore.java:351)
at alluxio.worker.block.TieredBlockStore.removeBlock(TieredBlockStore.java:344)
at alluxio.worker.block.DefaultBlockWorker.removeBlock(DefaultBlockWorker.java:517)
at alluxio.worker.block.AsyncBlockRemover$BlockRemover.run(AsyncBlockRemover.java:103)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
See attachment for jstack and worker.log.
To Reproduce This only happened once and is not very easily reproduced.
- Mount HDFS to alluxio.
- Run Presto queries and use a lot of worker space.
- Run
alluxio fs free /
. (Thefree
command didn’t return any error. After thefree
ALL files under Alluxio fromalluxio fs ls -R /
showing 0%.)
Expected behavior AsyncBlockRemover makes progress.
Urgency MEDIUM
Additional context worker.log jstack.log
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (12 by maintainers)
Top GitHub Comments
Hello @ZacBlanco I encountered the same problem on
alluxio-2.3.0
. Allblock-removal-service-%d
threads blocked for acquire lock. I dumped the heap blew,SessionLock
has 10 more instances thanLockRecord
, while the size of block-removal-service pool is 10. Is there some case that no releaseSessionLocak
instances?shall be solved by https://github.com/Alluxio/alluxio/pull/12643