ROCm wrongly works with the pinned memory for a specific access pattern
See original GitHub issueRel #4923.
We are facing some weird testing failures with ROCm 4.0.1 (on MI50) as described in the ROCm limitations. Among them, I’ve confirmed that the failures of __getitem__
, ix_
, broadcast
and einsum
are going away if I make CuPy’s pinned memory pool disabled.
I tried to make a similar situation in HIP and have successfully reproduced the case with the specific access pattern (with ROCm 4.0.1 and MI50/MI100):
source
// hipcc -o test test.cc
#include <cassert>
#include <cstdlib>
#include <iostream>
#include "hip/hip_runtime.h"
using namespace std;
constexpr size_t size = 12;
constexpr size_t nbytes = size * sizeof(double);
void f(bool use_pinned, unsigned int coherence_flag = 0)
{
char *mem_cpu1, *mem_gpu, *mem_cpu2;
cout << endl;
cout << "use_pinned: " << boolalpha << use_pinned << endl;
if (use_pinned && coherence_flag) {
cout << "coherent: " << boolalpha << (coherence_flag == hipHostMallocCoherent) << endl;
}
// Allocate host memory from which copy
if (use_pinned) {
hipHostMalloc((void**)&mem_cpu1, nbytes, hipHostMallocPortable | hipHostMallocMapped | coherence_flag);
} else {
mem_cpu1 = (char*)malloc(nbytes);
}
assert(mem_cpu1);
// Allocate GPU memory
hipMalloc((void**)&mem_gpu, nbytes);
assert(mem_gpu);
// Allocate host memory to which copy
mem_cpu2 = (char*)malloc(sizeof(double));
assert(mem_cpu2);
for (int n = 0; n < size; ++n) {
// Fill the memory with double; copy from host; copy to host; get the n-th element
for(int i = 0; i < size; ++i)
((double*)mem_cpu1)[i] = i + 1;
hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, size * sizeof(double), hipMemcpyDefault);
hipMemcpy((void*)mem_cpu2, (void*)(mem_gpu + n * sizeof(double)), sizeof(double), hipMemcpyDefault);
cout << ((double*)mem_cpu2)[0] << " == " ;
// Fill the memory with float; copy from host; copy to host; get the n-th element
for(int i = 0; i < size; ++i)
((float*)mem_cpu1)[i] = i + 1;
hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, size * sizeof(float), hipMemcpyDefault);
hipMemcpy((void*)mem_cpu2, (void*)(mem_gpu + n * sizeof(float)), sizeof(float), hipMemcpyDefault);
cout << ((float*)mem_cpu2)[0] << endl;
}
free(mem_cpu2);
hipFree(mem_gpu);
if (use_pinned)
hipHostFree(mem_cpu1);
else
free(mem_cpu1);
}
int main(int argc, char* argv[])
{
f(true); // use pinned memory
//f(true, hipHostMallocCoherent);
//f(true, hipHostMallocNonCoherent);
f(false); // doesn't use pinned memory
return 0;
}
result
use_pinned: true
1 == 1
2 == 1.875
3 == 0
4 == 2
5 == 0
6 == 2.125
7 == 0
8 == 2.25
9 == 0
10 == 2.3125
11 == 0
12 == 2.375
use_pinned: false
1 == 1
2 == 2
3 == 3
4 == 4
5 == 5
6 == 6
7 == 7
8 == 8
9 == 9
10 == 10
11 == 11
12 == 12
I also checked the coherency options for host memory, but it didn’t make any difference.
With ROCm 4.1.1, I’ve observed that MI50 works fine with the same code even if the pinned memory is used, but MI100 still has the same wrong result.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Memory Management — ROCm 4.5.0 documentation
On some systems, registered memory may not be actually be pinned but uses OS or hardware facilities to all GPU access to the...
Read more >Interesting Twitter thread on why AMD's ROCm currently sucks
The main problem, in my opinion, is awful documentation and packaging. For example, ROCm officially supports the WX6800 now, ...
Read more >introduction to amd gpu programming with hip
packet to the command queue. This is done with user-level memory writes in Radeon. Open Compute (ROCm). No kernel drivers involved.
Read more >Introduction to AMD GPU Programming with HIP - OLCF
packet to the command queue. This is done with user-level memory writes in Radeon. Open Compute (ROCm). No kernel drivers involved. L2 Cache...
Read more >CuPy Documentation - Read the Docs
This behavior is specific to ROCm builds; when building CuPy for NVIDIA CUDA, ... They return NumPy arrays backed by pinned memory.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the reproducer. I tried it out. However, I could not replicate issue on MI100 on ROCm 4.0.1, 4.1.0, 4.1.1.
I will check internally with the team.
FYI, https://github.com/kmaehashi/cupy-rocm-ci-report/commits/gh-pages now configured to use MI100 (gfx908). Jenkins can only cover MI50 / single-GPU unit tests, so I keep running the repo to cover MI100 / multi-gpu tests.