question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError in cupy.cuda.memory._Chunk.split

See original GitHub issue
  • Conditions
CuPy Version          : 5.2.0
CUDA Root             : /usr/local/cuda-10.0
CUDA Build Version    : 10000
CUDA Driver Version   : 10000
CUDA Runtime Version  : 10000
cuDNN Build Version   : 7301
cuDNN Version         : 7301
NCCL Build Version    : 2307
  • Code to reproduce Sorry currently I can’t provide any code to reproduce the issue. The AssertionError happens only after a very complex loop was executed stably for 4 hours, and it happens with different tracebacks – sometimes it’s tensordot, sometimes it’s matrix division and so on. But they all have the following traceback in common:
   File "cupy/cuda/memory.pyx", line 517, in cupy.cuda.memory.alloc
   File "cupy/cuda/memory.pyx", line 1064, in cupy.cuda.memory.MemoryPool.malloc
   File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
   File "cupy/cuda/memory.pyx", line 899, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
   File "cupy/cuda/memory.pyx", line 915, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
   File "cupy/cuda/memory.pyx", line 667, in cupy.cuda.memory._get_chunk
   File "cupy/cuda/memory.pyx", line 223, in cupy.cuda.memory._Chunk.split

I don’t expect the developers can find any meaningful clue by the limited infomation (I’d be happy if they can!). I’m merely asking for help: could this be related to some known issue? can anyone provide me with infomation on how to stably reproduce the error? What are the possible causes?

Feel free to close this issue if the developers felt it’s not necessary to deal with the uncertain bug here. I’m sorry for bothering you.

Thank you all in advance.

EDIT: just FYI, I reproduced the error on a different configuration:

CuPy Version          : 5.3.0
CUDA Root             : /opt/cuda8
CUDA Build Version    : 8000
CUDA Driver Version   : 8000
CUDA Runtime Version  : 8000
cuDNN Build Version   : 7102
cuDNN Version         : 7003
NCCL Build Version    : 2213
NCCL Runtime Version  : (unknown)

EDIT2: a similar issue is #317

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
liwt31commented, Mar 9, 2019

I find out what went wrong now. In certain functions that modify arena, a_index and a_flag, unexpected recursion will happen if Python GC is triggered, leading to undefined behavior. For example: https://github.com/cupy/cupy/blob/7590091c552f9a8471e01902f82f1737c58a8cc3/cupy/cuda/memory.pyx#L700-L719 When creating set(), Python GC could be triggered to call the destructor of Momery: https://github.com/cupy/cupy/blob/7590091c552f9a8471e01902f82f1737c58a8cc3/cupy/cuda/memory.pyx#L80-L82 which in turn will call free and _append_to_free_list. Below is a typical traceback:

(gdb) bt
#0  0x00007f1b3b1fc957 in __pyx_f_4cupy_4cuda_6memory__append_to_free_list (__pyx_v_arena=__pyx_v_arena@entry=0x7f1b36ba3a88, __pyx_v_a_index=__pyx_v_a_index@entry=0x559d6f9417d8, 
    __pyx_v_a_flag=__pyx_v_a_flag@entry=0x559d6f911238, __pyx_v_chunk=__pyx_v_chunk@entry=0x7f1b15dee528) at cupy/cuda/memory.cpp:15413
#1  0x00007f1b3b1fdc0b in __pyx_f_4cupy_4cuda_6memory_22SingleDeviceMemoryPool_free (__pyx_v_self=__pyx_v_self@entry=0x7f1b36c0b2b8, __pyx_v_ptr=__pyx_v_ptr@entry=30141793792, __pyx_v_size=__pyx_v_size@entry=37376, 
    __pyx_skip_dispatch=__pyx_skip_dispatch@entry=0) at cupy/cuda/memory.cpp:19349
#2  0x00007f1b3b1ff6f3 in __pyx_f_4cupy_4cuda_6memory_12PooledMemory_free (__pyx_v_self=__pyx_v_self@entry=0x7f1b13b64880, __pyx_skip_dispatch=__pyx_skip_dispatch@entry=0) at cupy/cuda/memory.cpp:12025
#3  0x00007f1b3b20100a in __pyx_pf_4cupy_4cuda_6memory_12PooledMemory_4__dealloc__ (__pyx_v_self=0x7f1b13b64880) at cupy/cuda/memory.cpp:12482
#4  __pyx_pw_4cupy_4cuda_6memory_12PooledMemory_5__dealloc__ (__pyx_v_self=0x7f1b13b64880) at cupy/cuda/memory.cpp:12435
#5  __pyx_tp_dealloc_4cupy_4cuda_6memory_PooledMemory (o=0x7f1b13b64880) at cupy/cuda/memory.cpp:31215
#6  0x00007f1b3b1ccaf7 in __pyx_tp_dealloc_4cupy_4cuda_6memory_MemoryPointer (o=0x7f1b300bd2d0) at cupy/cuda/memory.cpp:30252
#7  0x00007f1b58d561b7 in __pyx_tp_dealloc_4cupy_4core_4core_ndarray (o=0x7f1b13baf0d8) at cupy/core/core.cpp:43849
#8  0x0000559d6e2cdcaa in dict_dealloc (mp=0x7f1b13d99ca8) at /tmp/build/80754af9/python_1546130271559/work/Objects/dictobject.c:2017
#9  subtype_clear (self=<optimized out>) at /tmp/build/80754af9/python_1546130271559/work/Objects/typeobject.c:1067
#10 delete_garbage (old=<optimized out>, collectable=<optimized out>) at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:864
#11 collect () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1016
#12 0x0000559d6e34237d in collect_with_callback () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1128
#13 0x0000559d6e342430 in collect_generations () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1151
#14 0x0000559d6e29f3a4 in _PyObject_GC_Alloc (basicsize=basicsize@entry=139754271587664, use_calloc=0) at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1729
#15 _PyObject_GC_Malloc () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1739
#16 0x0000559d6e2f42c5 in PyType_GenericAlloc.localalias.299 () at /tmp/build/80754af9/python_1546130271559/work/Objects/typeobject.c:936
#17 0x0000559d6e2bad68 in make_new_set (iterable=0x0, type=0x559d6e45bfe0 <PySet_Type>) at /tmp/build/80754af9/python_1546130271559/work/Objects/setobject.c:1046
#18 PySet_New () at /tmp/build/80754af9/python_1546130271559/work/Objects/setobject.c:2280
#19 0x00007f1b3b1fcc46 in __pyx_f_4cupy_4cuda_6memory__append_to_free_list (__pyx_v_arena=__pyx_v_arena@entry=0x7f1b36ba3a88, __pyx_v_a_index=__pyx_v_a_index@entry=0x559d6f9417d8, 
    __pyx_v_a_flag=__pyx_v_a_flag@entry=0x559d6f911238, __pyx_v_chunk=__pyx_v_chunk@entry=0x7f1b3011a948) at cupy/cuda/memory.cpp:15697
[ ... ]

The result is that the index in _append_to_free_list become outdated and chunk will be inserted in the wrong position, leading to the AssertionError.

My suggestion is to use gc.disable and gc.enable when entering and leaving these functions just as using locks. I’m happy to prepare a PR on this.

0reactions
kmaehashicommented, Mar 26, 2019

Fixed via #2103. Thanks @liwt31!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory Management — CuPy 11.4.0 documentation
When you monitor the memory usage (e.g., using nvidia-smi for GPU memory or ps for CPU memory), you may notice that memory not...
Read more >
CuPy Documentation - Read the Docs
CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of cupy. ndarray, the core multi-dimensional ...
Read more >
nvtabular.loader.tensorflow — NVTabular 2021 documentation
As such, we attempt to configure TensorFlow to restrict its memory ... tf.device changes the cupy cuda device, which breaks us on multigpu...
Read more >
stable PDF - Numba Documentation
divided by the chunk size, truncated to the nearest integer. ... CUDA kernels from Python using Numba and CuPy - Valentin Haenel (Video)....
Read more >
python - Cupy memory error on Google Colab with GPU
Yes, using CuPy-backed Dask arrays would probably be fine here. You would want to make sure to use the single-threaded scheduler ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found