question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Successive copy to the same device memory from a pinned memory takes no effect on AMD ROCm 4.0.0 + MI100

See original GitHub issue

On AMD ROCm 4.0.0 + MI100 GPU, I observed that successively copying data to the same device memory area from a pinned memory takes no effect. CuPy uses its memory pool facility to reuse a device memory area once allocated and sees this situation. Some tests on MI100 fail because of this. I’ll also report this case to AMD folks.

$ python test.py
use_pinned: True
[1. 1. 1.]
[1. 1. 1.]  # should be filled with 2.0
use_pinned: False
[1. 1. 1.]
[2. 2. 2.]

Environment

$ python -c 'import cupy; cupy.show_config()'
(pending)

Python reproducer

import numpy
import cupy

size = 3
nbytes = size * numpy.dtype('d').itemsize

def f(use_pinned):
    # Allocate host memory from which copy
    print()
    print(f'use_pinned: {use_pinned}')
    if use_pinned:
        mem = cupy.cuda.pinned_memory.alloc_pinned_memory(nbytes)
        cpu1 = numpy.frombuffer(mem, 'd', size)
    else:
        cpu1 = numpy.zeros((size,), dtype='d')

    # Allocate GPU memory
    gpu = cupy.zeros((size,), dtype='d')

    # Allocate host memory to which copy
    cpu2 = numpy.zeros((size,), dtype='d')

    # Fill the host memory with 1.0; copy from host; copy to host
    cpu1.fill(1)
    gpu.data.copy_from_host(cpu1.ctypes.data, nbytes)
    gpu.data.copy_to_host(cpu2.ctypes.data, nbytes)
    print(cpu2)

    # Fill the host memory with 2.0; copy from host; copy to host
    cpu1.fill(2)
    gpu.data.copy_from_host(cpu1.ctypes.data, nbytes)
    gpu.data.copy_to_host(cpu2.ctypes.data, nbytes)
    print(cpu2)  # FAILS on MI100, should be filled with 2.0

f(use_pinned=True)
f(use_pinned=False)

HIP reproducer

// hipcc -o test test.cc

#include <cassert>
#include <cstdlib>
#include <iostream>
#include "hip/hip_runtime.h"

using namespace std;

constexpr size_t size = 3;
constexpr size_t nbytes = size * sizeof(double);

void print(double* mem) {
  for(int i = 0; i < size; ++i) {
    if (i > 0) cout << ", ";
    cout << mem[i];
  }
  cout << endl;
}

void f(bool use_pinned)
{
  char *mem_cpu1, *mem_gpu, *mem_cpu2;

  cout << endl;
  cout << "use_pinned: " << boolalpha << use_pinned << endl;

  // Allocate host memory from which copy
  if (use_pinned) {
    hipHostMalloc((void**)&mem_cpu1, nbytes, hipHostMallocPortable);
  } else {
    mem_cpu1 = (char*)malloc(nbytes);
  }
  assert(mem_cpu1);

  // Allocate GPU memory
  hipMalloc((void**)&mem_gpu, nbytes);
  assert(mem_gpu);

  // Allocate host memory to which copy
  mem_cpu2 = (char*)malloc(nbytes);
  assert(mem_cpu2);

  // Fill the host memory with 1.0; copy from host; copy to host
  for(int i = 0; i < size; ++i)
    ((double*)mem_cpu1)[i] = 1.0;
  hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, nbytes, hipMemcpyHostToDevice);
  hipMemcpy((void*)mem_cpu2, (void*)mem_gpu, nbytes, hipMemcpyDeviceToHost);
  print((double*)mem_cpu2);

  // Fill the host memory with 2.0; copy from host; copy to host
  for(int i = 0; i < size; ++i)
    ((double*)mem_cpu1)[i] = 2.0;
  hipMemcpy((void*)mem_gpu, (void*)mem_cpu1, nbytes, hipMemcpyHostToDevice);
  hipMemcpy((void*)mem_cpu2, (void*)mem_gpu, nbytes, hipMemcpyDeviceToHost);
  print((double*)mem_cpu2);  // FAILS on MI100, should be filled with 2.0

  free(mem_cpu2);
  hipFree(mem_gpu);
  hipHostFree(mem_cpu1);
}

int main(int argc, char* argv[])
{
  f(true);   // use pinned memory
  f(false);  // doesn't use pinnde memory
  return 0;
}

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
leofangcommented, Mar 30, 2021

Thanks, @takagi. I suspect this bug has to do with numerous weird failures we saw earlier (ex: https://github.com/cupy/cupy/pull/4653#issuecomment-778724972). Let me report to upstream and get their confirmation for the fix.

1reaction
takagicommented, Mar 30, 2021

😅 Anyway, I also checked it works!

Read more comments on GitHub >

github_iconTop Results From Across the Web

ROCm Documentation
AMD ROCm is the first open-source software development platform for HPC/Hyperscale-class GPU computing. AMD ROCm brings the UNIX philosophy ...
Read more >
Radeon ROCm 4.0 Released With CDNA GPU Support ...
Phoronix: Radeon ROCm 4.0 Released With CDNA GPU Support (Instinct MI100) Announced just over one month ago to the day was the AMD...
Read more >
How to Optimize Data Transfers in CUDA C/C++
The peak bandwidth between the device memory and the GPU is much ... Data transfers using host pinned memory use the same cudaMemcpy() ......
Read more >
default pinned memory and the zero-copy ... - 블로그 - NAVER
In effect, the application serves as an inefficient intermediary that gets the data from the disk file to the socket. Each time data...
Read more >
[facebook/fresco] java.lang.NoClassDefFoundError: Failed ...
Method.invoke(Method.java:372) at com.android.internal.os. ... copy to the same device memory from a pinned memory takes no effect on AMD ROCm 4.0.0 + MI100...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found