Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow loading of data with `Series.new()`

See original GitHub issue

Loading 100M numbers as Int32Series withSeries.new() on an NVIDIA A10 (PCIe Gen4: 64 GB/s) takes approximately 5s. This translates into a theoretical bandwidth of 80MB/s, which is 800 times less than PCIe Gen4’s bandwidth. Is that to be expected? And if so, are there any workarounds?

As a point of comparison, a 93MB Parquet file on local NVMe storage with approximately 1GB of uncompressed data loads in about 320ms (3.1GB/s) with DataFrame.readParquet().

Another point of comparison would be the loading of 2 columns of 100M Int32 values from an Apache Arrow Table with DataFrame.fromArrow(Arrow.tableToIPC(arrowTable)), which takes 313ms (2.5GB/s). The IPC to DF loading takes 71ms out of these 313ms (11GB/s).

Issue Analytics

State:
Created a year ago
Comments:10

Top GitHub Comments

1reaction

trxcllntcommented, Jul 30, 2022

@ghalimi How are you constructing the Series? The slowest path is passing a JS Array of numbers, as we have to loop through and copy each value in C++. It will do a block H to D copy if you pass it an Int32Array, but the performance of the copy depends on certain factors.

Take the following example (run on my Turing RTX8000):

$ node -p <<EOF
const {Int32Buffer} = require('@rapidsai/cuda')
const {DeviceBuffer} = require('@rapidsai/rmm')
const {Series} = require('@rapidsai/cudf')

// 1
Series.new([0, 1, 2]).sum() // ensure CUDA driver is initialized

var i32 = new Int32Array(10**9)

console.time('new DeviceBuffer(i32)')
new DeviceBuffer(i32)
console.timeEnd('new DeviceBuffer(i32)')

// 2
var i32 = new Int32Array(10**9)

console.time('new Int32Buffer(new DeviceBuffer(i32))')
new Int32Buffer(new DeviceBuffer(i32))
console.timeEnd('new Int32Buffer(new DeviceBuffer(i32))')

// 3
// var i32 = new Int32Array(10**9)

console.time('Series.new(new Int32Buffer(new DeviceBuffer(i32)))')
Series.new(new Int32Buffer(new DeviceBuffer(i32)))
console.timeEnd('Series.new(new Int32Buffer(new DeviceBuffer(i32)))')
EOF
> new DeviceBuffer(i32): 1.363s
> new Int32Buffer(new DeviceBuffer(i32)): 1.370s
> Series.new(new Int32Buffer(new DeviceBuffer(i32))): 377.321ms

Here I’m paying the fixed-cost driver initialization time. This that can skew the results, so warming up the driver on startup is recommended before testing.
After the first copy, I create a new slab of host memory and test again.
Here I’m reusing the same host memory created at step 2, and the time is significantly less. This is the driver optimizing the copy (sees the same pointer, data hasn’t changed, then eliding the HtoD copy). If you uncomment this line, the runtime of the last test jumps back up to ~1.37s.

~1.37s to copy 3.81Gb is ~2.72Gb/s, close to what you’re seeing with Parquet and Arrow. This is because they’re all copying from pageable host memory, essentially the slowest kind of HtoD copy in CUDA.

If possible, one workaround is to use pinned host memory (the bindings for which are in @rapidsai/cuda):

$ node -p <<EOF
const {PinnedMemory, Int32Buffer} = require('@rapidsai/cuda')
const {DeviceBuffer} = require('@rapidsai/rmm')
const {Series} = require('@rapidsai/cudf')

Series.new([0, 1, 2]).sum() // ensure CUDA driver is initialized

var i32 = new Int32Buffer(new PinnedMemory(10**9))

console.time('new DeviceBuffer(i32)')
new DeviceBuffer(i32)
console.timeEnd('new DeviceBuffer(i32)')

console.time('Series.new(i32)')
Series.new(i32)
console.timeEnd('Series.new(i32)')
EOF
> new DeviceBuffer(i32): 83.094ms
> Series.new(i32): 86.523ms

The above equates to 43-45Gb/s, saturating the full PCI-E 3 bandwidth of my Turing cards.

0reactions

ghalimicommented, Jul 30, 2022

Great! That’s going to save us a ton of headache…