AssertionError in cupy.cuda.memory._Chunk.split
See original GitHub issue- Conditions
CuPy Version : 5.2.0
CUDA Root : /usr/local/cuda-10.0
CUDA Build Version : 10000
CUDA Driver Version : 10000
CUDA Runtime Version : 10000
cuDNN Build Version : 7301
cuDNN Version : 7301
NCCL Build Version : 2307
- Code to reproduce Sorry currently I can’t provide any code to reproduce the issue. The AssertionError happens only after a very complex loop was executed stably for 4 hours, and it happens with different tracebacks – sometimes it’s tensordot, sometimes it’s matrix division and so on. But they all have the following traceback in common:
File "cupy/cuda/memory.pyx", line 517, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1064, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 899, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 915, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 667, in cupy.cuda.memory._get_chunk
File "cupy/cuda/memory.pyx", line 223, in cupy.cuda.memory._Chunk.split
I don’t expect the developers can find any meaningful clue by the limited infomation (I’d be happy if they can!). I’m merely asking for help: could this be related to some known issue? can anyone provide me with infomation on how to stably reproduce the error? What are the possible causes?
Feel free to close this issue if the developers felt it’s not necessary to deal with the uncertain bug here. I’m sorry for bothering you.
Thank you all in advance.
EDIT: just FYI, I reproduced the error on a different configuration:
CuPy Version : 5.3.0
CUDA Root : /opt/cuda8
CUDA Build Version : 8000
CUDA Driver Version : 8000
CUDA Runtime Version : 8000
cuDNN Build Version : 7102
cuDNN Version : 7003
NCCL Build Version : 2213
NCCL Runtime Version : (unknown)
EDIT2: a similar issue is #317
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Memory Management — CuPy 11.4.0 documentation
When you monitor the memory usage (e.g., using nvidia-smi for GPU memory or ps for CPU memory), you may notice that memory not...
Read more >CuPy Documentation - Read the Docs
CuPy is an implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of cupy. ndarray, the core multi-dimensional ...
Read more >nvtabular.loader.tensorflow — NVTabular 2021 documentation
As such, we attempt to configure TensorFlow to restrict its memory ... tf.device changes the cupy cuda device, which breaks us on multigpu...
Read more >stable PDF - Numba Documentation
divided by the chunk size, truncated to the nearest integer. ... CUDA kernels from Python using Numba and CuPy - Valentin Haenel (Video)....
Read more >python - Cupy memory error on Google Colab with GPU
Yes, using CuPy-backed Dask arrays would probably be fine here. You would want to make sure to use the single-threaded scheduler ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I find out what went wrong now. In certain functions that modify
arena
,a_index
anda_flag
, unexpected recursion will happen if Python GC is triggered, leading to undefined behavior. For example: https://github.com/cupy/cupy/blob/7590091c552f9a8471e01902f82f1737c58a8cc3/cupy/cuda/memory.pyx#L700-L719 When creatingset()
, Python GC could be triggered to call the destructor ofMomery
: https://github.com/cupy/cupy/blob/7590091c552f9a8471e01902f82f1737c58a8cc3/cupy/cuda/memory.pyx#L80-L82 which in turn will callfree
and_append_to_free_list
. Below is a typical traceback:The result is that the
index
in_append_to_free_list
become outdated andchunk
will be inserted in the wrong position, leading to theAssertionError
.My suggestion is to use
gc.disable
andgc.enable
when entering and leaving these functions just as using locks. I’m happy to prepare a PR on this.Fixed via #2103. Thanks @liwt31!