Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError in cupy.cuda.memory._Chunk.split

See original GitHub issue

Conditions

CuPy Version          : 5.2.0
CUDA Root             : /usr/local/cuda-10.0
CUDA Build Version    : 10000
CUDA Driver Version   : 10000
CUDA Runtime Version  : 10000
cuDNN Build Version   : 7301
cuDNN Version         : 7301
NCCL Build Version    : 2307

Code to reproduce Sorry currently I can’t provide any code to reproduce the issue. The AssertionError happens only after a very complex loop was executed stably for 4 hours, and it happens with different tracebacks – sometimes it’s tensordot, sometimes it’s matrix division and so on. But they all have the following traceback in common:

   File "cupy/cuda/memory.pyx", line 517, in cupy.cuda.memory.alloc
   File "cupy/cuda/memory.pyx", line 1064, in cupy.cuda.memory.MemoryPool.malloc
   File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
   File "cupy/cuda/memory.pyx", line 899, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
   File "cupy/cuda/memory.pyx", line 915, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
   File "cupy/cuda/memory.pyx", line 667, in cupy.cuda.memory._get_chunk
   File "cupy/cuda/memory.pyx", line 223, in cupy.cuda.memory._Chunk.split

I don’t expect the developers can find any meaningful clue by the limited infomation (I’d be happy if they can!). I’m merely asking for help: could this be related to some known issue? can anyone provide me with infomation on how to stably reproduce the error? What are the possible causes?

Feel free to close this issue if the developers felt it’s not necessary to deal with the uncertain bug here. I’m sorry for bothering you.

Thank you all in advance.

EDIT: just FYI, I reproduced the error on a different configuration:

CuPy Version          : 5.3.0
CUDA Root             : /opt/cuda8
CUDA Build Version    : 8000
CUDA Driver Version   : 8000
CUDA Runtime Version  : 8000
cuDNN Build Version   : 7102
cuDNN Version         : 7003
NCCL Build Version    : 2213
NCCL Runtime Version  : (unknown)

EDIT2: a similar issue is #317

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

liwt31commented, Mar 9, 2019

I find out what went wrong now. In certain functions that modify arena, a_index and a_flag, unexpected recursion will happen if Python GC is triggered, leading to undefined behavior. For example: https://github.com/cupy/cupy/blob/7590091c552f9a8471e01902f82f1737c58a8cc3/cupy/cuda/memory.pyx#L700-L719 When creating set(), Python GC could be triggered to call the destructor of Momery: https://github.com/cupy/cupy/blob/7590091c552f9a8471e01902f82f1737c58a8cc3/cupy/cuda/memory.pyx#L80-L82 which in turn will call free and _append_to_free_list. Below is a typical traceback:

(gdb) bt
#0  0x00007f1b3b1fc957 in __pyx_f_4cupy_4cuda_6memory__append_to_free_list (__pyx_v_arena=__pyx_v_arena@entry=0x7f1b36ba3a88, __pyx_v_a_index=__pyx_v_a_index@entry=0x559d6f9417d8, 
    __pyx_v_a_flag=__pyx_v_a_flag@entry=0x559d6f911238, __pyx_v_chunk=__pyx_v_chunk@entry=0x7f1b15dee528) at cupy/cuda/memory.cpp:15413
#1  0x00007f1b3b1fdc0b in __pyx_f_4cupy_4cuda_6memory_22SingleDeviceMemoryPool_free (__pyx_v_self=__pyx_v_self@entry=0x7f1b36c0b2b8, __pyx_v_ptr=__pyx_v_ptr@entry=30141793792, __pyx_v_size=__pyx_v_size@entry=37376, 
    __pyx_skip_dispatch=__pyx_skip_dispatch@entry=0) at cupy/cuda/memory.cpp:19349
#2  0x00007f1b3b1ff6f3 in __pyx_f_4cupy_4cuda_6memory_12PooledMemory_free (__pyx_v_self=__pyx_v_self@entry=0x7f1b13b64880, __pyx_skip_dispatch=__pyx_skip_dispatch@entry=0) at cupy/cuda/memory.cpp:12025
#3  0x00007f1b3b20100a in __pyx_pf_4cupy_4cuda_6memory_12PooledMemory_4__dealloc__ (__pyx_v_self=0x7f1b13b64880) at cupy/cuda/memory.cpp:12482
#4  __pyx_pw_4cupy_4cuda_6memory_12PooledMemory_5__dealloc__ (__pyx_v_self=0x7f1b13b64880) at cupy/cuda/memory.cpp:12435
#5  __pyx_tp_dealloc_4cupy_4cuda_6memory_PooledMemory (o=0x7f1b13b64880) at cupy/cuda/memory.cpp:31215
#6  0x00007f1b3b1ccaf7 in __pyx_tp_dealloc_4cupy_4cuda_6memory_MemoryPointer (o=0x7f1b300bd2d0) at cupy/cuda/memory.cpp:30252
#7  0x00007f1b58d561b7 in __pyx_tp_dealloc_4cupy_4core_4core_ndarray (o=0x7f1b13baf0d8) at cupy/core/core.cpp:43849
#8  0x0000559d6e2cdcaa in dict_dealloc (mp=0x7f1b13d99ca8) at /tmp/build/80754af9/python_1546130271559/work/Objects/dictobject.c:2017
#9  subtype_clear (self=<optimized out>) at /tmp/build/80754af9/python_1546130271559/work/Objects/typeobject.c:1067
#10 delete_garbage (old=<optimized out>, collectable=<optimized out>) at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:864
#11 collect () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1016
#12 0x0000559d6e34237d in collect_with_callback () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1128
#13 0x0000559d6e342430 in collect_generations () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1151
#14 0x0000559d6e29f3a4 in _PyObject_GC_Alloc (basicsize=basicsize@entry=139754271587664, use_calloc=0) at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1729
#15 _PyObject_GC_Malloc () at /tmp/build/80754af9/python_1546130271559/work/Modules/gcmodule.c:1739
#16 0x0000559d6e2f42c5 in PyType_GenericAlloc.localalias.299 () at /tmp/build/80754af9/python_1546130271559/work/Objects/typeobject.c:936
#17 0x0000559d6e2bad68 in make_new_set (iterable=0x0, type=0x559d6e45bfe0 <PySet_Type>) at /tmp/build/80754af9/python_1546130271559/work/Objects/setobject.c:1046
#18 PySet_New () at /tmp/build/80754af9/python_1546130271559/work/Objects/setobject.c:2280
#19 0x00007f1b3b1fcc46 in __pyx_f_4cupy_4cuda_6memory__append_to_free_list (__pyx_v_arena=__pyx_v_arena@entry=0x7f1b36ba3a88, __pyx_v_a_index=__pyx_v_a_index@entry=0x559d6f9417d8, 
    __pyx_v_a_flag=__pyx_v_a_flag@entry=0x559d6f911238, __pyx_v_chunk=__pyx_v_chunk@entry=0x7f1b3011a948) at cupy/cuda/memory.cpp:15697
[ ... ]

The result is that the index in _append_to_free_list become outdated and chunk will be inserted in the wrong position, leading to the AssertionError.

My suggestion is to use gc.disable and gc.enable when entering and leaving these functions just as using locks. I’m happy to prepare a PR on this.

0reactions

kmaehashicommented, Mar 26, 2019

Fixed via #2103. Thanks @liwt31!