Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ROCm wrongly works with the pinned memory for a specific access pattern

See original GitHub issue

Rel #4923.

We are facing some weird testing failures with ROCm 4.0.1 (on MI50) as described in the ROCm limitations. Among them, I’ve confirmed that the failures of __getitem__, ix_, broadcast and einsum are going away if I make CuPy’s pinned memory pool disabled.

I tried to make a similar situation in HIP and have successfully reproduced the case with the specific access pattern (with ROCm 4.0.1 and MI50/MI100):

source

// hipcc -o test test.cc

#include <cassert>
#include <cstdlib>
#include <iostream>
#include "hip/hip_runtime.h"

using namespace std;

constexpr size_t size = 12;
constexpr size_t nbytes = size * sizeof(double);

void f(bool use_pinned, unsigned int coherence_flag = 0)
{
  char *mem_cpu1, *mem_gpu, *mem_cpu2;

  cout << endl;
  cout << "use_pinned: " << boolalpha << use_pinned << endl;
  if (use_pinned && coherence_flag) {
    cout << "coherent: " << boolalpha << (coherence_flag == hipHostMallocCoherent) << endl;
  }

  // Allocate host memory from which copy
  if (use_pinned) {
    hipHostMalloc((void**)&mem_cpu1, nbytes, hipHostMallocPortable | hipHostMallocMapped | coherence_flag);
  } else {
    mem_cpu1 = (char*)malloc(nbytes);
  }
  assert(mem_cpu1);

  // Allocate GPU memory
  hipMalloc((void**)&mem_gpu, nbytes);
  assert(mem_gpu);

  // Allocate host memory to which copy
  mem_cpu2 = (char*)malloc(sizeof(double));
  assert(mem_cpu2);

  for (int n = 0; n < size; ++n) {
    // Fill the memory with double; copy from host; copy to host; get the n-th element
    for(int i = 0; i < size; ++i)
      ((double*)mem_cpu1)[i] = i + 1;
    hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, size * sizeof(double), hipMemcpyDefault);
    hipMemcpy((void*)mem_cpu2, (void*)(mem_gpu + n * sizeof(double)), sizeof(double), hipMemcpyDefault);
    cout << ((double*)mem_cpu2)[0] << " == " ;

    // Fill the memory with float; copy from host; copy to host; get the n-th element
    for(int i = 0; i < size; ++i)
      ((float*)mem_cpu1)[i] = i + 1;
    hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, size * sizeof(float), hipMemcpyDefault);
    hipMemcpy((void*)mem_cpu2, (void*)(mem_gpu + n * sizeof(float)), sizeof(float), hipMemcpyDefault);
    cout << ((float*)mem_cpu2)[0] << endl;
  }

  free(mem_cpu2);
  hipFree(mem_gpu);
  if (use_pinned)
    hipHostFree(mem_cpu1);
  else
    free(mem_cpu1);
}

int main(int argc, char* argv[])
{
  f(true);   // use pinned memory
  //f(true, hipHostMallocCoherent);
  //f(true, hipHostMallocNonCoherent);
  f(false);  // doesn't use pinned memory
  return 0;
}

result

use_pinned: true
1 == 1
2 == 1.875
3 == 0
4 == 2
5 == 0
6 == 2.125
7 == 0
8 == 2.25
9 == 0
10 == 2.3125
11 == 0
12 == 2.375

use_pinned: false
1 == 1
2 == 2
3 == 3
4 == 4
5 == 5
6 == 6
7 == 7
8 == 8
9 == 9
10 == 10
11 == 11
12 == 12

I also checked the coherency options for host memory, but it didn’t make any difference.

With ROCm 4.1.1, I’ve observed that MI50 works fine with the same code even if the pinned memory is used, but MI100 still has the same wrong result.

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:8 (8 by maintainers)

Top GitHub Comments

4reactions

amathews-amdcommented, Apr 23, 2021

Thanks for the reproducer. I tried it out. However, I could not replicate issue on MI100 on ROCm 4.0.1, 4.1.0, 4.1.1.

I will check internally with the team.

2reactions

kmaehashicommented, Jul 15, 2021

FYI, https://github.com/kmaehashi/cupy-rocm-ci-report/commits/gh-pages now configured to use MI100 (gfx908). Jenkins can only cover MI50 / single-GPU unit tests, so I keep running the repo to cover MI100 / multi-gpu tests.