question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory

See original GitHub issue

Description I run TorchScript model on tritis. The input is tensor [batch_size, 3, 640, 640]. Everything OK while batch_size <= 13 (batch_size <= 63897600 bytes). But it crashes when batch_size >= 14 (batch_size > 78643200 bytes). Think that tritis can not allocate memory on GPU Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory so it collect data in RAM, but model weights is on GPU and RuntimeError RuntimeError: Input type (CPUFloatType) and weight type (CUDAFloatType) should be the same occurs. But GPU have 24GB of memory and ~70MB of data is too small for it. Seems like there is 64M GPU memory limit for my model. Isn’t it?.. tritis log:

I0708 09:15:09.914852 1 grpc_server.cc:576] Process for ModelMetadata, rpc_ok=1, 1 step START
I0708 09:15:09.914905 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:09.914928 1 model_repository_manager.cc:564] VersionStates() 'retinaface'
I0708 09:15:09.915108 1 grpc_server.cc:576] Process for ModelMetadata, rpc_ok=1, 1 step COMPLETE
I0708 09:15:09.915137 1 grpc_server.cc:549] Ready for RPC 'ModelMetadata', 2
I0708 09:15:09.915150 1 grpc_server.cc:672] Done for ModelMetadata, 1
I0708 09:15:09.915854 1 grpc_server.cc:576] Process for ModelConfig, rpc_ok=1, 1 step START
I0708 09:15:09.915886 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:09.916671 1 grpc_server.cc:576] Process for ModelConfig, rpc_ok=1, 1 step COMPLETE
I0708 09:15:09.916726 1 grpc_server.cc:549] Ready for RPC 'ModelConfig', 2
I0708 09:15:09.916740 1 grpc_server.cc:672] Done for ModelConfig, 1
I0708 09:15:10.723230 1 grpc_server.cc:2630] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0708 09:15:10.726620 1 grpc_server.cc:2623] New request handler for ModelInferHandler, 6
I0708 09:15:10.726638 1 model_repository_manager.cc:608] GetInferenceBackend() 'retinaface' version 2
I0708 09:15:10.726659 1 infer_request.cc:347] add original input: [0x0x7fcb54005740] request id: 1, model: retinaface, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: []
override inputs:
inputs:
original requested outputs:
requested outputs:

I0708 09:15:10.726691 1 infer_request.cc:480] prepared: [0x0x7fcb54005740] request id: 1, model: retinaface, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 16, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: [3,640,640]
override inputs:
inputs:
[0x0x7fcb540059c8] input: input__0, type: FP32, original shape: [16,3,640,640], shape: [3,640,640]
original requested outputs:
output__0
output__1
output__2
requested outputs:
output__0
output__1
output__2

I0708 09:15:10.726790 1 libtorch_backend.cc:550] Running retinaface_0_0_gpu0 with 1 requests
W0708 09:15:10.726861 1 memory.cc:135] Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
I0708 09:15:10.726878 1 pinned_memory_manager.cc:130] pinned memory allocation: size 78643200, addr 0x7fcd00000090
I0708 09:15:10.743076 1 libtorch_backend.cc:772] The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_283.py", line 17, in forward
    _7 = getattr(self.model.ctx_modules, "1")
    _8 = getattr(self.model.ctx_modules, "0")
    _9, _10, _11, _12, _13, = (self.model.fpn).forward(input, )
                               ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _14 = (_8).forward(_9, )
    _15 = (_7).forward(_10, )
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_180.py", line 34, in forward
    _11 = self.lateral6
    _12 = self.pyramid6
    _13, _14, _15, _16, = (self._backbone).forward(input, )
                           ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _17 = (_11).forward((_12).forward(_13, ), )
    _18 = (_10).forward(_13, )
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_151.py", line 14, in forward
    _4 = (self._backbone.conv1).forward(input, )
    _5 = self._backbone.maxpool
    _6 = (self._backbone.relu).forward((_3).forward(_4, ), )
                                        ~~~~~~~~~~~ <--- HERE
    _7 = (self._backbone.layer1).forward((_5).forward(_6, ), )
    _8 = (_2).forward(_7, )
  File "code/__torch__/torch/nn/modules/module/___torch_mangle_151.py", line 12, in forward
    _2 = self._backbone.layer2
    _3 = self._backbone.bn1
    _4 = (self._backbone.conv1).forward(input, )
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _5 = self._backbone.maxpool
    _6 = (self._backbone.relu).forward((_3).forward(_4, ), )
  File "code/__torch__/torch/nn/modules/module.py", line 8, in forward
  def forward(self: __torch__.torch.nn.modules.module.Module,
    input: Tensor) -> Tensor:
    input0 = torch._convolution(input, self.weight, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1, False, False, True)
             ~~~~~~~~~~~~~~~~~~ <--- HERE
    return input0

Traceback of TorchScript, original code (most recent call last):
<...>
RuntimeError: Input type (CPUFloatType) and weight type (CUDAFloatType) should be the same

I0708 09:15:10.743114 1 grpc_server.cc:2762] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0708 09:15:10.743247 1 grpc_server.cc:2408] ModelInferHandler::InferRequestComplete
I0708 09:15:10.743257 1 grpc_server.cc:2630] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETEI0708 09:15:10.743261 1 pinned_memory_manager.cc:157] pinned memory deallocation: addr 0x7fcd00000090

I0708 09:15:10.743278 1 grpc_server.cc:510] Done for ModelInferHandler, 5

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:03:00.0 Off |                    0 |
| N/A   33C    P0    51W / 250W |   2139MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18382      C   tritonserver                                2129MiB |
+-----------------------------------------------------------------------------+

Triton Information I have build by myself triton-inference-server from github tags/v2.0.0.

tritis run:

nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /path/to//model_repository:/models tritonserver tritonserver --model-repository=/models --log-verbose=true

To Reproduce Framework: PyTorch TorchScript model.pt Model configuration:

name: "retinaface"
platform: "pytorch_libtorch"
max_batch_size: 64
input [
    {
        name: "input__0"
        data_type: TYPE_FP32
        format: FORMAT_NCHW
        dims: [ 3, 640, 640 ]
    }
]
output [
    {
        name: "output__0"
        data_type: TYPE_FP32
        dims: [ -1 ]
    },
    {
        name: "output__1"
        data_type: TYPE_FP32
        dims: [ -1, 4 ]
    },
    {
        name: "output__2"
        data_type: TYPE_FP32
        dims: [ -1, 5, 2 ]
    }
]
instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]
dynamic_batching {
    preferred_batch_size: [ 2, 4, 8, 16, 32, 64 ]
    max_queue_delay_microseconds: 100
}
version_policy: { all { } }

Expected behavior It has to work with any 1 <= batch_size <= <GPU memory size>

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
deadeyegoodwincommented, Jul 8, 2020

It is a cmdline flag for tritonserver executable. See “tritonserver --help”

1reaction
deadeyegoodwincommented, Jul 8, 2020

There is a CUDA memory pool from which Triton allocates CUDA memory when needed. You can increase the size of that pool using the --cuda-memory-pool-byte-size.

Read more comments on GitHub >

github_iconTop Results From Across the Web

"Failed to allocate CUDA memory with byte size" WARNNING
There is not enough GPU memory available for your model, so it is falling back to using pinned system memory.
Read more >
Pinned memory limit - CUDA - NVIDIA Developer Forums
My machine has 128 GB physical memory (yes, 128 GB, and I can allocate that much memory using malloc). My GPU is Tesla...
Read more >
CUDA_ERROR_OUT_OF_MEM...
It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes...
Read more >
CUDA Python 12.0.0 documentation - GitHub Pages
The size in bytes of user-allocated constant memory required by this function. CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES = 3#. The size in bytes of local ...
Read more >
GPU memory allocation - JAX documentation - Read the Docs
Preallocating minimizes allocation overhead and memory fragmentation, but can sometimes cause out-of-memory (OOM) errors. If your JAX process fails with OOM, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found