Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory

See original GitHub issue

Description I run TorchScript model on tritis. The input is tensor [batch_size, 3, 640, 640]. Everything OK while batch_size <= 13 (batch_size <= 63897600 bytes). But it crashes when batch_size >= 14 (batch_size > 78643200 bytes). Think that tritis can not allocate memory on GPU Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory so it collect data in RAM, but model weights is on GPU and RuntimeError RuntimeError: Input type (CPUFloatType) and weight type (CUDAFloatType) should be the same occurs. But GPU have 24GB of memory and ~70MB of data is too small for it. Seems like there is 64M GPU memory limit for my model. Isn’t it?.. tritis log:

I0708 09:15:09.914852 1 grpc_server.cc:576] Process for ModelMetadata, rpc_ok=1, 1 step START
I0708 09:15:09.914905 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:09.914928 1 model_repository_manager.cc:564] VersionStates() 'retinaface'
I0708 09:15:09.915108 1 grpc_server.cc:576] Process for ModelMetadata, rpc_ok=1, 1 step COMPLETE
I0708 09:15:09.915137 1 grpc_server.cc:549] Ready for RPC 'ModelMetadata', 2
I0708 09:15:09.915150 1 grpc_server.cc:672] Done for ModelMetadata, 1
I0708 09:15:09.915854 1 grpc_server.cc:576] Process for ModelConfig, rpc_ok=1, 1 step START
I0708 09:15:09.915886 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:09.916671 1 grpc_server.cc:576] Process for ModelConfig, rpc_ok=1, 1 step COMPLETE
I0708 09:15:09.916726 1 grpc_server.cc:549] Ready for RPC 'ModelConfig', 2
I0708 09:15:09.916740 1 grpc_server.cc:672] Done for ModelConfig, 1
I0708 09:15:10.723230 1 grpc_server.cc:2630] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0708 09:15:10.726620 1 grpc_server.cc:2623] New request handler for ModelInferHandler, 6
I0708 09:15:10.726638 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:10.726659 1 infer_request.cc:347] add original input: [0x0x7fcb54005740] request id: 1, model: retinaface, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: []
override inputs:
inputs:
original requested outputs:
requested outputs:

I0708 09:15:10.726691 1 infer_request.cc:480] prepared: [0x0x7fcb54005740] request id: 1, model: retinaface, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 16, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: [3,640,640]
override inputs:
inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: [3,640,640]
original requested outputs:
output__0
output__1
output__2
requested outputs:
output__0
output__1
output__2

I0708 09:15:10.726790 1 libtorch_backend.cc:550] Running retinaface_0_0_gpu0 with 1 requests
W0708 09:15:10.726861 1 memory.cc:135] Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
I0708 09:15:10.726878 1 pinned_memory_manager.cc:130] pinned memory allocation: size 78643200, addr 0x7fcd00000090
I0708 09:15:10.743076 1 libtorch_backend.cc:772] The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_283.py", line 17, in forward
    _7 = getattr(self.model.ctx_modules, "1")
    _8 = getattr(self.model.ctx_modules, "0")
    _9, _10, _11, _12, _13, = (self.model.fpn).forward(input, )
                               ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _14 = (_8).forward(_9, )
    _15 = (_7).forward(_10, )
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_180.py", line 34, in forward
    _11 = self.lateral6
    _12 = self.pyramid6
    _13, _14, _15, _16, = (self._backbone).forward(input, )
                           ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _17 = (_11).forward((_12).forward(_13, ), )
    _18 = (_10).forward(_13, )
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_151.py", line 14, in forward
    _4 = (self._backbone.conv1).forward(input, )
    _5 = self._backbone.maxpool
    _6 = (self._backbone.relu).forward((_3).forward(_4, ), )
                                        ~~~~~~~~~~~ <--- HERE
    _7 = (self._backbone.layer1).forward((_5).forward(_6, ), )
    _8 = (_2).forward(_7, )
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_151.py", line 12, in forward
    _2 = self._backbone.layer2
    _3 = self._backbone.bn1
    _4 = (self._backbone.conv1).forward(input, )
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _5 = self._backbone.maxpool
    _6 = (self._backbone.relu).forward((_3).forward(_4, ), )
  File "code/__torch__/torch/nn/modules/module.py", line 8, in forward
  def forward(self: __torch__.torch.nn.modules.module.Module,
    input: Tensor) -> Tensor:
    input0 = torch._convolution(input, self.weight, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1, False, False, True)
             ~~~~~~~~~~~~~~~~~~ <--- HERE
    return input0

Traceback of TorchScript, original code (most recent call last):
<...>
RuntimeError: Input type (CPUFloatType) and weight type (CUDAFloatType) should be the same

I0708 09:15:10.743114 1 grpc_server.cc:2762] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0708 09:15:10.743247 1 grpc_server.cc:2408] ModelInferHandler::InferRequestComplete
I0708 09:15:10.743257 1 grpc_server.cc:2630] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETEI0708 09:15:10.743261 1 pinned_memory_manager.cc:157] pinned memory deallocation: addr 0x7fcd00000090

I0708 09:15:10.743278 1 grpc_server.cc:510] Done for ModelInferHandler, 5

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:03:00.0 Off |                    0 |
| N/A   33C    P0    51W / 250W |   2139MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18382      C   tritonserver                                2129MiB |
+-----------------------------------------------------------------------------+

Triton Information I have build by myself triton-inference-server from github tags/v2.0.0.

tritis run:

nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /path/to//model_repository:/models tritonserver tritonserver --model-repository=/models --log-verbose=true

To Reproduce Framework: PyTorch TorchScript model.pt Model configuration:

name: "retinaface"
platform: "pytorch_libtorch"
max_batch_size: 64
input [
    {
        name: "input__0"
        data_type: TYPE_FP32
        format: FORMAT_NCHW
        dims: [ 3, 640, 640 ]
    }
]
output [
    {
        name: "output__0"
        data_type: TYPE_FP32
        dims: [ -1 ]
    },
    {
        name: "output__1"
        data_type: TYPE_FP32
        dims: [ -1, 4 ]
    },
    {
        name: "output__2"
        data_type: TYPE_FP32
        dims: [ -1, 5, 2 ]
    }
]
instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]
dynamic_batching {
    preferred_batch_size: [ 2, 4, 8, 16, 32, 64 ]
    max_queue_delay_microseconds: 100
}
version_policy: { all { } }

Expected behavior It has to work with any 1 <= batch_size <= <GPU memory size>

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

deadeyegoodwincommented, Jul 8, 2020

It is a cmdline flag for tritonserver executable. See “tritonserver --help”

1reaction

deadeyegoodwincommented, Jul 8, 2020

There is a CUDA memory pool from which Triton allocates CUDA memory when needed. You can increase the size of that pool using the --cuda-memory-pool-byte-size.

Top Results From Across the Web

"Failed to allocate CUDA memory with byte size" WARNNING

There is not enough GPU memory available for your model, so it is falling back to using pinned system memory.

Pinned memory limit - CUDA - NVIDIA Developer Forums

My machine has 128 GB physical memory (yes, 128 GB, and I can allocate that much memory using malloc). My GPU is Tesla...

CUDA_ERROR_OUT_OF_MEM...

It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes...

CUDA Python 12.0.0 documentation - GitHub Pages

The size in bytes of user-allocated constant memory required by this function. CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES = 3#. The size in bytes of local ...

GPU memory allocation - JAX documentation - Read the Docs

Preallocating minimizes allocation overhead and memory fragmentation, but can sometimes cause out-of-memory (OOM) errors. If your JAX process fails with OOM, ...