Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
See original GitHub issueDescription
I run TorchScript model on tritis. The input is tensor [batch_size, 3, 640, 640]
. Everything OK while batch_size <= 13 (batch_size <= 63897600 bytes)
. But it crashes when batch_size >= 14 (batch_size > 78643200 bytes)
.
Think that tritis can not allocate memory on GPU Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
so it collect data in RAM, but model weights is on GPU and RuntimeError RuntimeError: Input type (CPUFloatType) and weight type (CUDAFloatType) should be the same
occurs.
But GPU have 24GB of memory and ~70MB of data is too small for it. Seems like there is 64M GPU memory limit for my model. Isn’t it?..
tritis log:
I0708 09:15:09.914852 1 grpc_server.cc:576] Process for ModelMetadata, rpc_ok=1, 1 step START
I0708 09:15:09.914905 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:09.914928 1 model_repository_manager.cc:564] VersionStates() 'retinaface'
I0708 09:15:09.915108 1 grpc_server.cc:576] Process for ModelMetadata, rpc_ok=1, 1 step COMPLETE
I0708 09:15:09.915137 1 grpc_server.cc:549] Ready for RPC 'ModelMetadata', 2
I0708 09:15:09.915150 1 grpc_server.cc:672] Done for ModelMetadata, 1
I0708 09:15:09.915854 1 grpc_server.cc:576] Process for ModelConfig, rpc_ok=1, 1 step START
I0708 09:15:09.915886 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:09.916671 1 grpc_server.cc:576] Process for ModelConfig, rpc_ok=1, 1 step COMPLETE
I0708 09:15:09.916726 1 grpc_server.cc:549] Ready for RPC 'ModelConfig', 2
I0708 09:15:09.916740 1 grpc_server.cc:672] Done for ModelConfig, 1
I0708 09:15:10.723230 1 grpc_server.cc:2630] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0708 09:15:10.726620 1 grpc_server.cc:2623] New request handler for ModelInferHandler, 6
I0708 09:15:10.726638 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:10.726659 1 infer_request.cc:347] add original input: [0x0x7fcb54005740] request id: 1, model: retinaface, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: []
override inputs:
inputs:
original requested outputs:
requested outputs:
I0708 09:15:10.726691 1 infer_request.cc:480] prepared: [0x0x7fcb54005740] request id: 1, model: retinaface, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 16, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: [3,640,640]
override inputs:
inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: [3,640,640]
original requested outputs:
output__0
output__1
output__2
requested outputs:
output__0
output__1
output__2
I0708 09:15:10.726790 1 libtorch_backend.cc:550] Running retinaface_0_0_gpu0 with 1 requests
W0708 09:15:10.726861 1 memory.cc:135] Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
I0708 09:15:10.726878 1 pinned_memory_manager.cc:130] pinned memory allocation: size 78643200, addr 0x7fcd00000090
I0708 09:15:10.743076 1 libtorch_backend.cc:772] The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch/nn/modules/module/___torch_mangle_283.py", line 17, in forward
_7 = getattr(self.model.ctx_modules, "1")
_8 = getattr(self.model.ctx_modules, "0")
_9, _10, _11, _12, _13, = (self.model.fpn).forward(input, )
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_14 = (_8).forward(_9, )
_15 = (_7).forward(_10, )
File "code/__torch__/torch/nn/modules/module/___torch_mangle_180.py", line 34, in forward
_11 = self.lateral6
_12 = self.pyramid6
_13, _14, _15, _16, = (self._backbone).forward(input, )
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_17 = (_11).forward((_12).forward(_13, ), )
_18 = (_10).forward(_13, )
File "code/__torch__/torch/nn/modules/module/___torch_mangle_151.py", line 14, in forward
_4 = (self._backbone.conv1).forward(input, )
_5 = self._backbone.maxpool
_6 = (self._backbone.relu).forward((_3).forward(_4, ), )
~~~~~~~~~~~ <--- HERE
_7 = (self._backbone.layer1).forward((_5).forward(_6, ), )
_8 = (_2).forward(_7, )
File "code/__torch__/torch/nn/modules/module/___torch_mangle_151.py", line 12, in forward
_2 = self._backbone.layer2
_3 = self._backbone.bn1
_4 = (self._backbone.conv1).forward(input, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_5 = self._backbone.maxpool
_6 = (self._backbone.relu).forward((_3).forward(_4, ), )
File "code/__torch__/torch/nn/modules/module.py", line 8, in forward
def forward(self: __torch__.torch.nn.modules.module.Module,
input: Tensor) -> Tensor:
input0 = torch._convolution(input, self.weight, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1, False, False, True)
~~~~~~~~~~~~~~~~~~ <--- HERE
return input0
Traceback of TorchScript, original code (most recent call last):
<...>
RuntimeError: Input type (CPUFloatType) and weight type (CUDAFloatType) should be the same
I0708 09:15:10.743114 1 grpc_server.cc:2762] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0708 09:15:10.743247 1 grpc_server.cc:2408] ModelInferHandler::InferRequestComplete
I0708 09:15:10.743257 1 grpc_server.cc:2630] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETEI0708 09:15:10.743261 1 pinned_memory_manager.cc:157] pinned memory deallocation: addr 0x7fcd00000090
I0708 09:15:10.743278 1 grpc_server.cc:510] Done for ModelInferHandler, 5
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:03:00.0 Off | 0 |
| N/A 33C P0 51W / 250W | 2139MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18382 C tritonserver 2129MiB |
+-----------------------------------------------------------------------------+
Triton Information I have build by myself triton-inference-server from github tags/v2.0.0.
tritis run:
nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /path/to//model_repository:/models tritonserver tritonserver --model-repository=/models --log-verbose=true
To Reproduce Framework: PyTorch TorchScript model.pt Model configuration:
name: "retinaface"
platform: "pytorch_libtorch"
max_batch_size: 64
input [
{
name: "input__0"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output__1"
data_type: TYPE_FP32
dims: [ -1, 4 ]
},
{
name: "output__2"
data_type: TYPE_FP32
dims: [ -1, 5, 2 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
dynamic_batching {
preferred_batch_size: [ 2, 4, 8, 16, 32, 64 ]
max_queue_delay_microseconds: 100
}
version_policy: { all { } }
Expected behavior
It has to work with any 1 <= batch_size <= <GPU memory size>
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
It is a cmdline flag for tritonserver executable. See “tritonserver --help”
There is a CUDA memory pool from which Triton allocates CUDA memory when needed. You can increase the size of that pool using the --cuda-memory-pool-byte-size.