Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluate further serialization performance improvements

See original GitHub issue

Device to host serialization currently runs at around 1.5-2.0 GB/s. The major bottleneck is host memory allocation.

All serialization is now done by copying a Numba device array back to host as a NumPy array. Recently, hugepage support was introduced to NumPy and we should see benefits automatically if /sys/kernel/mm/transparent_hugepage/enabled is set to madvise or always, but I was only able to see benefits when it’s set to the latter. Even with that, host memory is being allocated at about 5 GB/s on a DGX-1, and copying happening at about 10 GB/s, which accounts for about 3 GB/s of copying data back to host (since both operations happen in sequence). Some details were discussed in #98, starting at https://github.com/rapidsai/dask-cuda/pull/98#issuecomment-517219368.

One alternative is to hold a host memory pool where we can transfer data to. That would require some custom memory copying function, since Numba requires the destination NumPy array to keep the same format (shape, dtype, etc.) as the device array, thus making it impossible to keep a pool for arrays of arbitrary formats.

cc @mrocklin @madsbk

Issue Analytics

State:
Created 4 years ago
Comments:22 (19 by maintainers)

Top GitHub Comments

2reactions

jakirkhamcommented, Jul 1, 2020

It would be good to reassess how long this is taking. While there haven’t been a lot of low-level changes, there are some notable ones like ensuring hugepages are used ( https://github.com/numpy/numpy/pull/14216 ). There have been a lot of high-level changes in Dask, Dask-CUDA, and RAPIDS since this issue was raised. For example CUDA objects have become "dask" serializable ( https://github.com/dask/distributed/pull/3482 ) ( https://github.com/rapidsai/cudf/pull/4153 ), which Dask-CUDA leverages for spilling ( https://github.com/rapidsai/dask-cuda/pull/256 ). Distributed learned how to serialize collections of CUDA objects regardless of size ( https://github.com/dask/distributed/pull/3689 ), which has simplified things in Dask-CUDA further ( https://github.com/rapidsai/dask-cuda/pull/307 ). A bug fix to Distributed’s spilling logic ( https://github.com/dask/distributed/pull/3639 ) and better memoryview serialization ( https://github.com/dask/distributed/pull/3743 ) has allowed us to perform fewer serialization passes ( https://github.com/rapidsai/dask-cuda/pull/309 ). We’ve also generalized, streamlined, and improved the robustnesses of serialization in RAPIDS through multiple PRs though most recently PR ( https://github.com/rapidsai/cudf/pull/5139 ). Believe there are probably more high-level improvements we can make here. Also we can make low-level improvements still like using pinned memory ( https://github.com/rapidsai/rmm/issues/260 ) and/or using different RMM memory resources together (like UVM typically and device memory for UCX communication). Additionally things like packing/unpacking ( https://github.com/rapidsai/cudf/pull/5025 ) would allow us to transfer a single buffer (instead of multiple) between host and device.

1reaction

leofangcommented, Jul 28, 2020

@jakirkham Out of curiosity I changed my benchmark script slightly (to not use %timeit), and enlarged the size to 4GB. I was able to see better performance with pinned memory.

import numpy, cupy, ctypes
from cupyx.time import repeat


a = numpy.asarray(memoryview(4 * 2**29 * b"ab"))
a_hugepages = numpy.copy(a)
a_pinned = numpy.ndarray(a.shape,
    buffer=cupy.cuda.alloc_pinned_memory(a.nbytes),
    dtype=a.dtype)
a_pinned[...] = a
a_cuda = cupy.cuda.memory.alloc(a.nbytes)

assert a.nbytes == a_hugepages.nbytes == a_pinned.nbytes
print(a.nbytes)
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_hugepages.ctypes.data), a.nbytes), n_repeat=100))
print(repeat(a_cuda.copy_from_host, (ctypes.c_void_p(a_pinned.ctypes.data), a.nbytes), n_repeat=100))

Output:

$ CUDA_VISIBLE_DEVICES=1 python test_hugepage.py 
4294967296
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
copy_from_host      :    CPU:463085.093 us   +/-174.826 (min:462799.658 / max:463892.217) us     GPU-0:463165.147 us   +/-174.704 (min:462879.517 / max:463972.321) us
copy_from_host      :    CPU:464054.178 us   +/-158.124 (min:463666.467 / max:464470.700) us     GPU-0:464135.328 us   +/-158.027 (min:463749.512 / max:464549.164) us
copy_from_host      :    CPU:354186.861 us   +/-23.496 (min:354148.254 / max:354238.851) us     GPU-0:354191.216 us   +/-23.393 (min:354152.100 / max:354242.249) us

But I am not always able to get this performance across runs. Occasionally they’re on par. (Could be that I’m not the only user on the system, but this alone would not explain the variation…)

Top Results From Across the Web

Improve Serialization Performance in Django Rest ...

When a developer chooses Python, Django, or Django Rest Framework, it's usually not because of its blazing fast performance.

Performance Improvements in .NET 7

This post deep-dives into hundreds of performance improvements ... NET 7 definitively highlights how much more can be and has been done.

Choosing your Serializer — if you can - Apache Flink

Avro (specific and generic) records as well as Thrift data types further reduce performance by 20% and 30%, respectively. You definitely want to ......

Reducing Serialization in Distributed Systems using ...

Chapter 6 oers thoughts on future work to further evaluate and utilize the ... The proposed method to improve CRDT performance is to...

Apache Spark Performance Boosting | by Halil Ertan

Serialization also plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, ...