Java index out of bounds exception when running many requests through server
See original GitHub issueContext
Trying to loadtest a torch serve model to gauge performance on a custom handler.
- torchserve version: 0.2.0
- torch version: 1.6.0
- java version: openjdk 11.0.8
- Operating System and version: Debian via the python 3.7-buster image.
Your Environment
- Are you planning to deploy it using docker container? [yes/no]: yes
- Is it a CPU or GPU environment?: CPU
- Using a default/custom handler? custom
- What kind of model is it e.g. vision, text, audio?: feed forward for custom input.
- Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? from model store
- Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: number of netty threads=32
Expected Behavior
Expected torch serve not to throw this error or understand what properties of the environment I could change to address it. It only seems to happen on medium load.
Current Behavior
With a load of ~5rps and varying batch size and CPU memory and count allocations the server will throw an errors in ~4%+ of requests.
Failure Logs [if any]
2020-10-17 00:16:41,887 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9002 Worker disconnected. WORKER_MODEL_LOADED 2020-10-17 00:16:41,887 [ERROR] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - Unknown exception io.netty.handler.codec.DecoderException: java.lang.IndexOutOfBoundsException: readerIndex(1021) + length(4) exceeds writerIndex(1024): PooledUnsafeDirectByteBuf(ridx: 1021, widx: 1024, cap: 1024) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:471) at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:404) at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:371) at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:354) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.IndexOutOfBoundsException: readerIndex(1021) + length(4) exceeds writerIndex(1024): PooledUnsafeDirectByteBuf(ridx: 1021, widx: 1024, cap: 1024) at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1477) at io.netty.buffer.AbstractByteBuf.readInt(AbstractByteBuf.java:810) at org.pytorch.serve.util.codec.ModelResponseDecoder.decode(ModelResponseDecoder.java:56) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
Thank you in advance for any help you can provide!
Issue Analytics
- State:
- Created 3 years ago
- Comments:38 (16 by maintainers)

Top Related StackOverflow Question
@harshbafna I was debugging this further and observed the following: Python backend sends the complete response of all the batched request , but when the frontend server gets its , its fragemented. Example for the below scenario , for the total response size of 500777 , the Message decoder gets the fragments
I suspect the issue is caused due to incorrect decoding of these fragments , What are your throught on this ? Shouldnt the reassembly of these fragments be done at a lower level and then be decoded at the application level ?
@harshbafna
Input to test:
This should return the following response:
config.properties: inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082 number_of_netty_threads=32 job_queue_size=1000 model_store=/home/model-server/model-store