Evaluate further serialization performance improvements
See original GitHub issueDevice to host serialization currently runs at around 1.5-2.0 GB/s. The major bottleneck is host memory allocation.
All serialization is now done by copying a Numba device array back to host as a NumPy array. Recently, hugepage support was introduced to NumPy and we should see benefits automatically if /sys/kernel/mm/transparent_hugepage/enabled
is set to madvise
or always
, but I was only able to see benefits when it’s set to the latter. Even with that, host memory is being allocated at about 5 GB/s on a DGX-1, and copying happening at about 10 GB/s, which accounts for about 3 GB/s of copying data back to host (since both operations happen in sequence). Some details were discussed in #98, starting at https://github.com/rapidsai/dask-cuda/pull/98#issuecomment-517219368.
One alternative is to hold a host memory pool where we can transfer data to. That would require some custom memory copying function, since Numba requires the destination NumPy array to keep the same format (shape, dtype, etc.) as the device array, thus making it impossible to keep a pool for arrays of arbitrary formats.
Issue Analytics
- State:
- Created 4 years ago
- Comments:22 (19 by maintainers)
Top GitHub Comments
It would be good to reassess how long this is taking. While there haven’t been a lot of low-level changes, there are some notable ones like ensuring hugepages are used ( https://github.com/numpy/numpy/pull/14216 ). There have been a lot of high-level changes in Dask, Dask-CUDA, and RAPIDS since this issue was raised. For example CUDA objects have become
"dask"
serializable ( https://github.com/dask/distributed/pull/3482 ) ( https://github.com/rapidsai/cudf/pull/4153 ), which Dask-CUDA leverages for spilling ( https://github.com/rapidsai/dask-cuda/pull/256 ). Distributed learned how to serialize collections of CUDA objects regardless of size ( https://github.com/dask/distributed/pull/3689 ), which has simplified things in Dask-CUDA further ( https://github.com/rapidsai/dask-cuda/pull/307 ). A bug fix to Distributed’s spilling logic ( https://github.com/dask/distributed/pull/3639 ) and bettermemoryview
serialization ( https://github.com/dask/distributed/pull/3743 ) has allowed us to perform fewer serialization passes ( https://github.com/rapidsai/dask-cuda/pull/309 ). We’ve also generalized, streamlined, and improved the robustnesses of serialization in RAPIDS through multiple PRs though most recently PR ( https://github.com/rapidsai/cudf/pull/5139 ). Believe there are probably more high-level improvements we can make here. Also we can make low-level improvements still like using pinned memory ( https://github.com/rapidsai/rmm/issues/260 ) and/or using different RMM memory resources together (like UVM typically and device memory for UCX communication). Additionally things like packing/unpacking ( https://github.com/rapidsai/cudf/pull/5025 ) would allow us to transfer a single buffer (instead of multiple) between host and device.@jakirkham Out of curiosity I changed my benchmark script slightly (to not use
%timeit
), and enlarged the size to 4GB. I was able to see better performance with pinned memory.Output:
But I am not always able to get this performance across runs. Occasionally they’re on par. (Could be that I’m not the only user on the system, but this alone would not explain the variation…)