Successive copy to the same device memory from a pinned memory takes no effect on AMD ROCm 4.0.0 + MI100
See original GitHub issueOn AMD ROCm 4.0.0 + MI100 GPU, I observed that successively copying data to the same device memory area from a pinned memory takes no effect. CuPy uses its memory pool facility to reuse a device memory area once allocated and sees this situation. Some tests on MI100 fail because of this. I’ll also report this case to AMD folks.
$ python test.py
use_pinned: True
[1. 1. 1.]
[1. 1. 1.] # should be filled with 2.0
use_pinned: False
[1. 1. 1.]
[2. 2. 2.]
Environment
$ python -c 'import cupy; cupy.show_config()'
(pending)
Python reproducer
import numpy
import cupy
size = 3
nbytes = size * numpy.dtype('d').itemsize
def f(use_pinned):
# Allocate host memory from which copy
print()
print(f'use_pinned: {use_pinned}')
if use_pinned:
mem = cupy.cuda.pinned_memory.alloc_pinned_memory(nbytes)
cpu1 = numpy.frombuffer(mem, 'd', size)
else:
cpu1 = numpy.zeros((size,), dtype='d')
# Allocate GPU memory
gpu = cupy.zeros((size,), dtype='d')
# Allocate host memory to which copy
cpu2 = numpy.zeros((size,), dtype='d')
# Fill the host memory with 1.0; copy from host; copy to host
cpu1.fill(1)
gpu.data.copy_from_host(cpu1.ctypes.data, nbytes)
gpu.data.copy_to_host(cpu2.ctypes.data, nbytes)
print(cpu2)
# Fill the host memory with 2.0; copy from host; copy to host
cpu1.fill(2)
gpu.data.copy_from_host(cpu1.ctypes.data, nbytes)
gpu.data.copy_to_host(cpu2.ctypes.data, nbytes)
print(cpu2) # FAILS on MI100, should be filled with 2.0
f(use_pinned=True)
f(use_pinned=False)
HIP reproducer
// hipcc -o test test.cc
#include <cassert>
#include <cstdlib>
#include <iostream>
#include "hip/hip_runtime.h"
using namespace std;
constexpr size_t size = 3;
constexpr size_t nbytes = size * sizeof(double);
void print(double* mem) {
for(int i = 0; i < size; ++i) {
if (i > 0) cout << ", ";
cout << mem[i];
}
cout << endl;
}
void f(bool use_pinned)
{
char *mem_cpu1, *mem_gpu, *mem_cpu2;
cout << endl;
cout << "use_pinned: " << boolalpha << use_pinned << endl;
// Allocate host memory from which copy
if (use_pinned) {
hipHostMalloc((void**)&mem_cpu1, nbytes, hipHostMallocPortable);
} else {
mem_cpu1 = (char*)malloc(nbytes);
}
assert(mem_cpu1);
// Allocate GPU memory
hipMalloc((void**)&mem_gpu, nbytes);
assert(mem_gpu);
// Allocate host memory to which copy
mem_cpu2 = (char*)malloc(nbytes);
assert(mem_cpu2);
// Fill the host memory with 1.0; copy from host; copy to host
for(int i = 0; i < size; ++i)
((double*)mem_cpu1)[i] = 1.0;
hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, nbytes, hipMemcpyHostToDevice);
hipMemcpy((void*)mem_cpu2, (void*)mem_gpu, nbytes, hipMemcpyDeviceToHost);
print((double*)mem_cpu2);
// Fill the host memory with 2.0; copy from host; copy to host
for(int i = 0; i < size; ++i)
((double*)mem_cpu1)[i] = 2.0;
hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, nbytes, hipMemcpyHostToDevice);
hipMemcpy((void*)mem_cpu2, (void*)mem_gpu, nbytes, hipMemcpyDeviceToHost);
print((double*)mem_cpu2); // FAILS on MI100, should be filled with 2.0
free(mem_cpu2);
hipFree(mem_gpu);
hipHostFree(mem_cpu1);
}
int main(int argc, char* argv[])
{
f(true); // use pinned memory
f(false); // doesn't use pinnde memory
return 0;
}
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:5 (5 by maintainers)
Top Results From Across the Web
ROCm Documentation
AMD ROCm is the first open-source software development platform for HPC/Hyperscale-class GPU computing. AMD ROCm brings the UNIX philosophy ...
Read more >Radeon ROCm 4.0 Released With CDNA GPU Support ...
Phoronix: Radeon ROCm 4.0 Released With CDNA GPU Support (Instinct MI100) Announced just over one month ago to the day was the AMD...
Read more >How to Optimize Data Transfers in CUDA C/C++
The peak bandwidth between the device memory and the GPU is much ... Data transfers using host pinned memory use the same cudaMemcpy() ......
Read more >default pinned memory and the zero-copy ... - 블로그 - NAVER
In effect, the application serves as an inefficient intermediary that gets the data from the disk file to the socket. Each time data...
Read more >[facebook/fresco] java.lang.NoClassDefFoundError: Failed ...
Method.invoke(Method.java:372) at com.android.internal.os. ... copy to the same device memory from a pinned memory takes no effect on AMD ROCm 4.0.0 + MI100...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks, @takagi. I suspect this bug has to do with numerous weird failures we saw earlier (ex: https://github.com/cupy/cupy/pull/4653#issuecomment-778724972). Let me report to upstream and get their confirmation for the fix.
😅 Anyway, I also checked it works!