question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ROCm wrongly works with the pinned memory for a specific access pattern

See original GitHub issue

Rel #4923.

We are facing some weird testing failures with ROCm 4.0.1 (on MI50) as described in the ROCm limitations. Among them, I’ve confirmed that the failures of __getitem__, ix_, broadcast and einsum are going away if I make CuPy’s pinned memory pool disabled.

I tried to make a similar situation in HIP and have successfully reproduced the case with the specific access pattern (with ROCm 4.0.1 and MI50/MI100):

source
// hipcc -o test test.cc

#include <cassert>
#include <cstdlib>
#include <iostream>
#include "hip/hip_runtime.h"

using namespace std;

constexpr size_t size = 12;
constexpr size_t nbytes = size * sizeof(double);

void f(bool use_pinned, unsigned int coherence_flag = 0)
{
  char *mem_cpu1, *mem_gpu, *mem_cpu2;

  cout << endl;
  cout << "use_pinned: " << boolalpha << use_pinned << endl;
  if (use_pinned && coherence_flag) {
    cout << "coherent: " << boolalpha << (coherence_flag == hipHostMallocCoherent) << endl;
  }

  // Allocate host memory from which copy
  if (use_pinned) {
    hipHostMalloc((void**)&mem_cpu1, nbytes, hipHostMallocPortable | hipHostMallocMapped | coherence_flag);
  } else {
    mem_cpu1 = (char*)malloc(nbytes);
  }
  assert(mem_cpu1);

  // Allocate GPU memory
  hipMalloc((void**)&mem_gpu, nbytes);
  assert(mem_gpu);

  // Allocate host memory to which copy
  mem_cpu2 = (char*)malloc(sizeof(double));
  assert(mem_cpu2);

  for (int n = 0; n < size; ++n) {
    // Fill the memory with double; copy from host; copy to host; get the n-th element
    for(int i = 0; i < size; ++i)
      ((double*)mem_cpu1)[i] = i + 1;
    hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, size * sizeof(double), hipMemcpyDefault);
    hipMemcpy((void*)mem_cpu2, (void*)(mem_gpu + n * sizeof(double)), sizeof(double), hipMemcpyDefault);
    cout << ((double*)mem_cpu2)[0] << " == " ;

    // Fill the memory with float; copy from host; copy to host; get the n-th element
    for(int i = 0; i < size; ++i)
      ((float*)mem_cpu1)[i] = i + 1;
    hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, size * sizeof(float), hipMemcpyDefault);
    hipMemcpy((void*)mem_cpu2, (void*)(mem_gpu + n * sizeof(float)), sizeof(float), hipMemcpyDefault);
    cout << ((float*)mem_cpu2)[0] << endl;
  }

  free(mem_cpu2);
  hipFree(mem_gpu);
  if (use_pinned)
    hipHostFree(mem_cpu1);
  else
    free(mem_cpu1);
}

int main(int argc, char* argv[])
{
  f(true);   // use pinned memory
  //f(true, hipHostMallocCoherent);
  //f(true, hipHostMallocNonCoherent);
  f(false);  // doesn't use pinned memory
  return 0;
}
result
use_pinned: true
1 == 1
2 == 1.875
3 == 0
4 == 2
5 == 0
6 == 2.125
7 == 0
8 == 2.25
9 == 0
10 == 2.3125
11 == 0
12 == 2.375

use_pinned: false
1 == 1
2 == 2
3 == 3
4 == 4
5 == 5
6 == 6
7 == 7
8 == 8
9 == 9
10 == 10
11 == 11
12 == 12

I also checked the coherency options for host memory, but it didn’t make any difference.

With ROCm 4.1.1, I’ve observed that MI50 works fine with the same code even if the pinned memory is used, but MI100 still has the same wrong result.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

4reactions
amathews-amdcommented, Apr 23, 2021

Thanks for the reproducer. I tried it out. However, I could not replicate issue on MI100 on ROCm 4.0.1, 4.1.0, 4.1.1.

I will check internally with the team.

2reactions
kmaehashicommented, Jul 15, 2021

FYI, https://github.com/kmaehashi/cupy-rocm-ci-report/commits/gh-pages now configured to use MI100 (gfx908). Jenkins can only cover MI50 / single-GPU unit tests, so I keep running the repo to cover MI100 / multi-gpu tests.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory Management — ROCm 4.5.0 documentation
On some systems, registered memory may not be actually be pinned but uses OS or hardware facilities to all GPU access to the...
Read more >
Interesting Twitter thread on why AMD's ROCm currently sucks
The main problem, in my opinion, is awful documentation and packaging. For example, ROCm officially supports the WX6800 now, ...
Read more >
introduction to amd gpu programming with hip
packet to the command queue. This is done with user-level memory writes in Radeon. Open Compute (ROCm). No kernel drivers involved.
Read more >
Introduction to AMD GPU Programming with HIP - OLCF
packet to the command queue. This is done with user-level memory writes in Radeon. Open Compute (ROCm). No kernel drivers involved. L2 Cache...
Read more >
CuPy Documentation - Read the Docs
This behavior is specific to ROCm builds; when building CuPy for NVIDIA CUDA, ... They return NumPy arrays backed by pinned memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found