Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Successive copy to the same device memory from a pinned memory takes no effect on AMD ROCm 4.0.0 + MI100

See original GitHub issue

On AMD ROCm 4.0.0 + MI100 GPU, I observed that successively copying data to the same device memory area from a pinned memory takes no effect. CuPy uses its memory pool facility to reuse a device memory area once allocated and sees this situation. Some tests on MI100 fail because of this. I’ll also report this case to AMD folks.

$ python test.py
use_pinned: True
[1. 1. 1.]
[1. 1. 1.]  # should be filled with 2.0
use_pinned: False
[1. 1. 1.]
[2. 2. 2.]

Environment

$ python -c 'import cupy; cupy.show_config()'
(pending)

Python reproducer

import numpy
import cupy

size = 3
nbytes = size * numpy.dtype('d').itemsize

def f(use_pinned):
    # Allocate host memory from which copy
    print()
    print(f'use_pinned: {use_pinned}')
    if use_pinned:
        mem = cupy.cuda.pinned_memory.alloc_pinned_memory(nbytes)
        cpu1 = numpy.frombuffer(mem, 'd', size)
    else:
        cpu1 = numpy.zeros((size,), dtype='d')

    # Allocate GPU memory
    gpu = cupy.zeros((size,), dtype='d')

    # Allocate host memory to which copy
    cpu2 = numpy.zeros((size,), dtype='d')

    # Fill the host memory with 1.0; copy from host; copy to host
    cpu1.fill(1)
    gpu.data.copy_from_host(cpu1.ctypes.data, nbytes)
    gpu.data.copy_to_host(cpu2.ctypes.data, nbytes)
    print(cpu2)

    # Fill the host memory with 2.0; copy from host; copy to host
    cpu1.fill(2)
    gpu.data.copy_from_host(cpu1.ctypes.data, nbytes)
    gpu.data.copy_to_host(cpu2.ctypes.data, nbytes)
    print(cpu2)  # FAILS on MI100, should be filled with 2.0

f(use_pinned=True)
f(use_pinned=False)

HIP reproducer

// hipcc -o test test.cc

#include <cassert>
#include <cstdlib>
#include <iostream>
#include "hip/hip_runtime.h"

using namespace std;

constexpr size_t size = 3;
constexpr size_t nbytes = size * sizeof(double);

void print(double* mem) {
  for(int i = 0; i < size; ++i) {
    if (i > 0) cout << ", ";
    cout << mem[i];
  }
  cout << endl;
}

void f(bool use_pinned)
{
  char *mem_cpu1, *mem_gpu, *mem_cpu2;

  cout << endl;
  cout << "use_pinned: " << boolalpha << use_pinned << endl;

  // Allocate host memory from which copy
  if (use_pinned) {
    hipHostMalloc((void**)&mem_cpu1, nbytes, hipHostMallocPortable);
  } else {
    mem_cpu1 = (char*)malloc(nbytes);
  }
  assert(mem_cpu1);

  // Allocate GPU memory
  hipMalloc((void**)&mem_gpu, nbytes);
  assert(mem_gpu);

  // Allocate host memory to which copy
  mem_cpu2 = (char*)malloc(nbytes);
  assert(mem_cpu2);

  // Fill the host memory with 1.0; copy from host; copy to host
  for(int i = 0; i < size; ++i)
    ((double*)mem_cpu1)[i] = 1.0;
  hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, nbytes, hipMemcpyHostToDevice);
  hipMemcpy((void*)mem_cpu2, (void*)mem_gpu, nbytes, hipMemcpyDeviceToHost);
  print((double*)mem_cpu2);

  // Fill the host memory with 2.0; copy from host; copy to host
  for(int i = 0; i < size; ++i)
    ((double*)mem_cpu1)[i] = 2.0;
  hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, nbytes, hipMemcpyHostToDevice);
  hipMemcpy((void*)mem_cpu2, (void*)mem_gpu, nbytes, hipMemcpyDeviceToHost);
  print((double*)mem_cpu2);  // FAILS on MI100, should be filled with 2.0

  free(mem_cpu2);
  hipFree(mem_gpu);
  hipHostFree(mem_cpu1);
}

int main(int argc, char* argv[])
{
  f(true);   // use pinned memory
  f(false);  // doesn't use pinnde memory
  return 0;
}

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

leofangcommented, Mar 30, 2021

Thanks, @takagi. I suspect this bug has to do with numerous weird failures we saw earlier (ex: https://github.com/cupy/cupy/pull/4653#issuecomment-778724972). Let me report to upstream and get their confirmation for the fix.

1reaction

takagicommented, Mar 30, 2021

😅 Anyway, I also checked it works!

Top Results From Across the Web

ROCm Documentation

AMD ROCm is the first open-source software development platform for HPC/Hyperscale-class GPU computing. AMD ROCm brings the UNIX philosophy ...

Radeon ROCm 4.0 Released With CDNA GPU Support ...

Phoronix: Radeon ROCm 4.0 Released With CDNA GPU Support (Instinct MI100) Announced just over one month ago to the day was the AMD...

How to Optimize Data Transfers in CUDA C/C++

The peak bandwidth between the device memory and the GPU is much ... Data transfers using host pinned memory use the same cudaMemcpy() ......

default pinned memory and the zero-copy ... - 블로그 - NAVER

In effect, the application serves as an inefficient intermediary that gets the data from the disk file to the socket. Each time data...

[facebook/fresco] java.lang.NoClassDefFoundError: Failed ...

Method.invoke(Method.java:372) at com.android.internal.os. ... copy to the same device memory from a pinned memory takes no effect on AMD ROCm 4.0.0 + MI100...