question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow loading of data with `Series.new()`

See original GitHub issue

Loading 100M numbers as Int32Series withSeries.new() on an NVIDIA A10 (PCIe Gen4: 64 GB/s) takes approximately 5s. This translates into a theoretical bandwidth of 80MB/s, which is 800 times less than PCIe Gen4’s bandwidth. Is that to be expected? And if so, are there any workarounds?

As a point of comparison, a 93MB Parquet file on local NVMe storage with approximately 1GB of uncompressed data loads in about 320ms (3.1GB/s) with DataFrame.readParquet().

Another point of comparison would be the loading of 2 columns of 100M Int32 values from an Apache Arrow Table with DataFrame.fromArrow(Arrow.tableToIPC(arrowTable)), which takes 313ms (2.5GB/s). The IPC to DF loading takes 71ms out of these 313ms (11GB/s).

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
trxcllntcommented, Jul 30, 2022

@ghalimi How are you constructing the Series? The slowest path is passing a JS Array of numbers, as we have to loop through and copy each value in C++. It will do a block H to D copy if you pass it an Int32Array, but the performance of the copy depends on certain factors.

Take the following example (run on my Turing RTX8000):

$ node -p <<EOF
const {Int32Buffer} = require('@rapidsai/cuda')
const {DeviceBuffer} = require('@rapidsai/rmm')
const {Series} = require('@rapidsai/cudf')

// 1
Series.new([0, 1, 2]).sum() // ensure CUDA driver is initialized

var i32 = new Int32Array(10**9)

console.time('new DeviceBuffer(i32)')
new DeviceBuffer(i32)
console.timeEnd('new DeviceBuffer(i32)')

// 2
var i32 = new Int32Array(10**9)

console.time('new Int32Buffer(new DeviceBuffer(i32))')
new Int32Buffer(new DeviceBuffer(i32))
console.timeEnd('new Int32Buffer(new DeviceBuffer(i32))')

// 3
// var i32 = new Int32Array(10**9)

console.time('Series.new(new Int32Buffer(new DeviceBuffer(i32)))')
Series.new(new Int32Buffer(new DeviceBuffer(i32)))
console.timeEnd('Series.new(new Int32Buffer(new DeviceBuffer(i32)))')
EOF
> new DeviceBuffer(i32): 1.363s
> new Int32Buffer(new DeviceBuffer(i32)): 1.370s
> Series.new(new Int32Buffer(new DeviceBuffer(i32))): 377.321ms
  1. Here I’m paying the fixed-cost driver initialization time. This that can skew the results, so warming up the driver on startup is recommended before testing.
  2. After the first copy, I create a new slab of host memory and test again.
  3. Here I’m reusing the same host memory created at step 2, and the time is significantly less. This is the driver optimizing the copy (sees the same pointer, data hasn’t changed, then eliding the HtoD copy). If you uncomment this line, the runtime of the last test jumps back up to ~1.37s.

~1.37s to copy 3.81Gb is ~2.72Gb/s, close to what you’re seeing with Parquet and Arrow. This is because they’re all copying from pageable host memory, essentially the slowest kind of HtoD copy in CUDA.

If possible, one workaround is to use pinned host memory (the bindings for which are in @rapidsai/cuda):

$ node -p <<EOF
const {PinnedMemory, Int32Buffer} = require('@rapidsai/cuda')
const {DeviceBuffer} = require('@rapidsai/rmm')
const {Series} = require('@rapidsai/cudf')

Series.new([0, 1, 2]).sum() // ensure CUDA driver is initialized

var i32 = new Int32Buffer(new PinnedMemory(10**9))

console.time('new DeviceBuffer(i32)')
new DeviceBuffer(i32)
console.timeEnd('new DeviceBuffer(i32)')

console.time('Series.new(i32)')
Series.new(i32)
console.timeEnd('Series.new(i32)')
EOF
> new DeviceBuffer(i32): 83.094ms
> Series.new(i32): 86.523ms

The above equates to 43-45Gb/s, saturating the full PCI-E 3 bandwidth of my Turing cards.

0reactions
ghalimicommented, Jul 30, 2022

Great! That’s going to save us a ton of headache…

Read more comments on GitHub >

github_iconTop Results From Across the Web

50 times faster data loading for Pandas: no problem
Doing statistics on an array of lists is again horribly slow. Transforming it to a regular table seems like a good bet, because...
Read more >
️ Load the same CSV file 10X times faster and with 10X less ...
Loading Data in Chunks: Memory Issues in pandas read_csv() are there for a long time. So one of the best workarounds to load...
Read more >
HighCharts is slow to load data when building multiple charts
The first is the main chart with 2 series and the second, is more of a 'thumbnail' chart and has 4 series. I...
Read more >
Data loading too slowly | ThoughtSpot Software
If your data is loading slowly, there are a few things you can do to fix it. Some tables may take an unusually...
Read more >
Speed up File Loading - Data Analysis & Processing with ...
In this lesson, we show how to speed up file loading. You may notice one thing, if you load a large CSV file...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found