question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Excessive malloc time on Nvidia Ampere Arch

See original GitHub issue

We recently upgraded our servers from V100s to A100s and encountered a lengthy initialization cupy.cuda.runtime.malloc issue. Narrowing down the problem we were able to reproduce it with:

docker run -it --rm cupy/cupy:v8.0.0 python3 -c 'import time, cupy; start=time.time(); ptr=cupy.cuda.runtime.malloc(1); end=time.time(); print(end-start); cupy.cuda.runtime.free(ptr)'

This took 67 seconds to malloc a single byte. However, successive mallocs within the same process took fractional seconds:

docker run -it --rm cupy/cupy:v8.0.0 python3 -c 'import time, cupy; start=time.time(); ptr=cupy.cuda.runtime.malloc(1); end=time.time(); print(end-start); cupy.cuda.runtime.free(ptr); start=time.time(); ptr=cupy.cuda.runtime.malloc(1); end=time.time(); print(end-start); cupy.cuda.runtime.free(ptr)'

Prints 67 seconds followed by 0.15 seconds. We put the V100s back into the same server and ran the same docker command and it resulted in 0.15s always. We have also tried to malloc different amounts, but whether it is 1 byte or multiple gigs, the first initialization is always roughly 67 seconds. Most of our processes do not last that long so the initial delay is doubling or tripling our runtimes.

We have contacted Nvidia and they were able to reproduce the same lengthy cupy.cuda.runtime.malloc on Ampere A100 and RTX A6000, with no such delay on Volta and Turing (they tested cuPy on A100, RTX A6000, GV100, & T4).

We plan on additional testing when time permits, but were wondering if this is something already known by cuPy (a quick search for Ampere and/or malloc delay returned nothing), and if there are additional commands or configurations we can try to help debug this problem once we have the hardware setup again.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
kmaehashicommented, Feb 2, 2021

I thought A100 is only supported since CUDA 11.0?

Right, but it runs while the initialization takes much time.

1reaction
kmaehashicommented, Feb 2, 2021

Thanks for reporting! Currently, the CuPy image on Dockerhub uses CUDA Toolkit 10.2 (Dockerfile), which does not support A100. We had a similar issue even in the bear metal environment with CUDA 10.x + A100. We will consider upgrading the image to use CUDA 11 or later in v9 releases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA C++ Programming Guide - NVIDIA Documentation Center
CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel...
Read more >
Using the NVIDIA CUDA Stream-Ordered Memory Allocator ...
This post introduces new API functions that enable memory allocation and deallocation to be stream-ordered operations.
Read more >
Cuda runtime call after driver api call, excessive overhead
This is similar to the behavior of malloc() in the C\C++ runtime library: When it runs out of memory it needs to go...
Read more >
NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog
This post gives you a look inside the new A100 GPU, and describes important new features of NVIDIA Ampere architecture GPUs.
Read more >
Why does cudaMallocHost takes so muck time compared to ...
Something is seriously wrong with either your tests or your CUDA system if that is the case. My testing with 64 bit Linux...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found