question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AsyncBlockRemover stuck not freeing any blocks

See original GitHub issue

Alluxio Version: 2.1.1

Describe the bug This issue is reported by @JySongWithZhangCe

alluxio fs free / freed the root mount. One worker failed to free the space, as illustrated.

image

In the worker.log there are a lot of below message

2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 504054677506 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 502712500226 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 503652024322 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 504591548418 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 503584915458 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 505078087685 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 505010978821 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 504004345861 is being removed. Current queue size is 320.
2020-04-17 14:50:41,485 DEBUG AsyncBlockRemover - 502662168581 is being removed. Current queue size is 320.

AsycBlockRemover gets a lot of remove calls but remove is not making any progress. There are around 85K lines of above message, whereas the queue size stays 320.

Whenever a block is successfully removed by AsyncBlockRemover there should be one below line in the debug log. However compared to 85K above log msgs, there are like 2 blocks successfully removed.

Block XXX is removed in thread XXX.

I did a jstack on Alluxio worker and ALL remover threads are waiting on the block lock.

"block-removal-service-8" #1097 daemon prio=5 os_prio=0 tid=0x00007f8e0d5be800 nid=0x155c4d waiting on condition [0x00007f8a44ed1000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006b416a280> (a java.util.concurrent.Semaphore$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
	at java.util.concurrent.Semaphore.acquireUninterruptibly(Semaphore.java:494)
	at alluxio.worker.block.ClientRWLock$SessionLock.lock(ClientRWLock.java:92)
	at alluxio.worker.block.BlockLockManager.lockBlock(BlockLockManager.java:114)
	at alluxio.worker.block.TieredBlockStore.removeBlockInternal(TieredBlockStore.java:835)
	at alluxio.worker.block.TieredBlockStore.removeBlock(TieredBlockStore.java:351)
	at alluxio.worker.block.TieredBlockStore.removeBlock(TieredBlockStore.java:344)
	at alluxio.worker.block.DefaultBlockWorker.removeBlock(DefaultBlockWorker.java:517)
	at alluxio.worker.block.AsyncBlockRemover$BlockRemover.run(AsyncBlockRemover.java:103)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

See attachment for jstack and worker.log.

To Reproduce This only happened once and is not very easily reproduced.

  1. Mount HDFS to alluxio.
  2. Run Presto queries and use a lot of worker space.
  3. Run alluxio fs free /. (The free command didn’t return any error. After the free ALL files under Alluxio from alluxio fs ls -R / showing 0%.)

Expected behavior AsyncBlockRemover makes progress.

Urgency MEDIUM

Additional context worker.log jstack.log

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
iAlaskacommented, Sep 7, 2020

Hello @ZacBlanco I encountered the same problem on alluxio-2.3.0. All block-removal-service-%d threads blocked for acquire lock. I dumped the heap blew, SessionLock has 10 more instances than LockRecord, while the size of block-removal-service pool is 10. Is there some case that no release SessionLocak instances?

 num     #instances         #bytes  class name
----------------------------------------------
...
 91:          1234          39488  alluxio.worker.block.BlockLockManager$LockRecord
 109:          1244          29856  alluxio.worker.block.ClientRWLock$SessionLock
0reactions
apc999commented, Jan 27, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

Developers - AsyncRemover thread stuck - - Bountysource
AsyncRemover should release the lock anyway to ensure that next remove loop not stuck. ... The only way that can clear the un-removed...
Read more >
User Command Line Interface - Alluxio v2.9.0 (stable ...
The free command does not delete any data from the under storage system, only removing the blocks of those files in Alluxio space...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found