WorkerOutOfSpaceException when setting max_readahead=0 with Alluxio JNI Fuse
See original GitHub issueAlluxio Version: What version of Alluxio are you using? https://github.com/cheyang/alluxio/tree/branch-2.3-fuse-pod-for-non-root
Describe the bug A clear and concise description of what the bug is. I started Alluxio with Alluxio JNI fuse on a 4-node k8s cluster. And my fuse opts are:
2020-09-29 05:46:31,172 INFO AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/alluxio-mnt/default/imagenet/alluxio-fuse", OPTIONS="[-obig_writes, -okernel_cache, -oro, -omax_read=131072, -omax_readahead=0, -oattr_timeout=7200, -oentry_timeout=7200, -ononempty, -oallow_other, -omax_write=131072]"
When I run my ResNet50 job reading ImageNet dataset through Alluxio JNI Fuse, I got the following exceptions:
In alluxio-fuse pod:
....
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
2020-09-29 05:55:32,450 INFO AlluxioJniFuseFileSystem - read: cached 0 bytes, missed 0 bytes, ratio 0.0
2020-09-29 05:55:34,264 WARN AlluxioFileInStream - Failed to close input stream for block 3875536896: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
2020-09-29 05:55:35,241 WARN AlluxioJniFuseFileSystem - read(path=/imagenet/imagenet/train/train-00835-of-01024,buf=java.nio.DirectByteBuffer[pos=4096 lim=4096 cap=4096],size=4096,offset=0) returned 4096 in 112442 ms (>=1000 ms)
2020-09-29 05:55:36,297 WARN AlluxioJniFuseFileSystem - read(path=/imagenet/imagenet/train/train-00368-of-01024,buf=java.nio.DirectByteBuffer[pos=4096 lim=4096 cap=4096],size=4096,offset=16777216) returned 4096 in 3940 ms (>=1000 ms)
2020-09-29 05:55:40,438 ERROR ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:349)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:529)
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:387)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
java.lang.Thread.run(Thread.java:748)
2020-09-29 05:55:40,439 ERROR ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:349)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:529)
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
java.lang.Thread.run(Thread.java:748)
2020-09-29 05:55:41,777 WARN AlluxioFileInStream - Failed to close input stream for block 6979321856: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
2020-09-29 05:55:41,779 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/imagenet/train/train-00981-of-01024,4096,0:
alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
at alluxio.client.block.stream.GrpcBlockingStream.receive(GrpcBlockingStream.java:159)
at alluxio.client.block.stream.GrpcDataMessageBlockingStream.receiveDataMessage(GrpcDataMessageBlockingStream.java:84)
at alluxio.client.block.stream.GrpcDataReader.readChunk(GrpcDataReader.java:129)
at alluxio.client.block.stream.BlockInStream.readChunk(BlockInStream.java:392)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:268)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:350)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:322)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:276)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:322)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)
2020-09-29 05:55:41,784 WARN AlluxioJniFuseFileSystem - read(path=/imagenet/imagenet/train/train-00981-of-01024,buf=java.nio.DirectByteBuffer[pos=0 lim=4096 cap=4096],size=4096,offset=0) returned -5 in 120068 ms (>=1000 ms)
2020-09-29 05:55:41,786 INFO AlluxioJniFuseFileSystem - release(fd=43,entries=240)
2020-09-29 05:55:41,799 WARN AlluxioFileInStream - Failed to close input stream for block 9193914368: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
2020-09-29 05:55:41,800 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/imagenet/train/train-00572-of-01024,4096,0:
alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
at alluxio.client.block.stream.GrpcBlockingStream.receive(GrpcBlockingStream.java:159)
at alluxio.client.block.stream.GrpcDataMessageBlockingStream.receiveDataMessage(GrpcDataMessageBlockingStream.java:84)
at alluxio.client.block.stream.GrpcDataReader.readChunk(GrpcDataReader.java:129)
at alluxio.client.block.stream.BlockInStream.readChunk(BlockInStream.java:392)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:268)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:350)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:322)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:276)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:322)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)
2020-09-29 05:55:41,800 WARN AlluxioFileInStream - Failed to close input stream for block 13404995584: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
2020-09-29 05:55:41,800 WARN AlluxioJniFuseFileSystem - read(path=/imagenet/imagenet/train/train-00572-of-01024,buf=java.nio.DirectByteBuffer[pos=0 lim=4096 cap=4096],size=4096,offset=0) returned -5 in 120067 ms (>=1000 ms)
2020-09-29 05:55:41,800 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/imagenet/train/train-00840-of-01024,4096,0:
alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 30000ms. (Zero Copy GrpcDataReader)
at alluxio.client.block.stream.GrpcBlockingStream.receive(GrpcBlockingStream.java:159)
at alluxio.client.block.stream.GrpcDataMessageBlockingStream.receiveDataMessage(GrpcDataMessageBlockingStream.java:84)
at alluxio.client.block.stream.GrpcDataReader.readChunk(GrpcDataReader.java:129)
at alluxio.client.block.stream.BlockInStream.readChunk(BlockInStream.java:392)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:268)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:350)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:322)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:276)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:322)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)
2020-09-29 05:55:41,800 WARN AlluxioJniFuseFileSystem - read(path=/imagenet/imagenet/train/train-00840-of-01024,buf=java.nio.DirectByteBuffer[pos=0 lim=4096 cap=4096],size=4096,offset=0) returned -5 in 120073 ms (>=1000 ms)
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
2020-09-29 05:55:41,838 INFO AlluxioJniFuseFileSystem - read: cached 0 bytes, missed 0 bytes, ratio 0.0
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
fuse: max_idle_threads: 10
2020-09-29 05:55:41,909 INFO AlluxioJniFuseFileSystem - read: cached 0 bytes, missed 0 bytes, ratio 0.0
....
In alluxio-worker pod, I got so many WorkerOutOfSpaceException
:
2020-09-29 05:53:57,114 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,123 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,124 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 16055795712, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00035-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 16055795712
2020-09-29 05:53:57,124 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,127 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,130 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 14831058944, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00843-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 14831058944
2020-09-29 05:53:57,131 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,131 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,131 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 8975810560, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00074-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 8975810560
2020-09-29 05:53:57,131 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,131 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,132 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 4378853376, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00323-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 4378853376
2020-09-29 05:53:57,132 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,132 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,132 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 8959033344, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00674-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 8959033344
2020-09-29 05:53:57,132 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,132 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,132 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 10301210624, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00828-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 10301210624
2020-09-29 05:53:57,132 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,132 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,133 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 6694109184, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00469-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 6694109184
2020-09-29 05:53:57,133 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,133 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,133 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 6509559808, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00324-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 6509559808
2020-09-29 05:53:57,133 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,133 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,133 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 4546625536, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00530-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 4546625536
2020-09-29 05:53:57,133 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,133 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,133 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 1610612736, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00958-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 1610612736
2020-09-29 05:53:57,134 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,134 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,134 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 2113929216, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00321-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 2113929216
2020-09-29 05:53:57,134 ERROR TieredBlockStore - Failed to free space. Min contiguous requested: 335544320, Min available requested: 335544320, Blocks iterated: 0, Blocks removed: 0, Space freed: 0
2020-09-29 05:53:57,134 ERROR TieredBlockStore - Allocation failure. Options: AllocateOptions{Location=BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>}, Size=335544320, ForceLocation=false, EvictionAllowed=true, UseReservedSpace=false}. Error: alluxio.exception.WorkerOutOfSpaceException: Failed to free 335544320 bytes space at location
2020-09-29 05:53:57,134 WARN UnderFileSystemBlockReader - Failed to update block writer for UFS block [blockId: 3472883712, ufsPath: /underFSStorage/imagenet/imagenet/train/train-00659-of-01024, offset: 0]: Failed to allocate 335544320 bytes on BlockStoreLocation{TierAlias=MEM, DirIndex=<Any>, MediumType=<Any>} to create blockId 3472883712
However, when I check my tierstore capacity, it seems far from full:
$ alluxio fsadmin report capacity
Worker Name Last Heartbeat Storage MEM
192.168.1.15 0 capacity 50.00GB
used 10.07GB (20%)
192.168.1.17 0 capacity 50.00GB
used 6.25GB (12%)
192.168.1.11 0 capacity 50.00GB
used 9.98GB (19%)
192.168.1.16 0 capacity 50.00GB
used 4896.00MB (9%)
All the logs are here: diagnose_fluid_1601359323.tar.gz
To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible) Setup a UFS with large dataset like ImageNet, mount it onto Alluxio, and setup Alluxio JNI Fuse with the options mentioned above. This may reproduce the problem.
Expected behavior
A clear and concise description of what you expected to happen.
Everything works fine and no WorkerOutOfSpaceException
Urgency Describe the impact and urgency of the bug.
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (4 by maintainers)
Top GitHub Comments
@apc999 Yes, in our case it is.
The commit 2c41226 fixed this problem. I’ll close this issue. @apc999 @LuQQiu Thanks for your help!