question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask's Arrow serialization slow & memory intensive

See original GitHub issue

I’m creating a dummy 80MB single-partition Dask distributed DataFrame, and attempting to convert it to a PyArrow Table.

Doing so causes a notebook to throw GC warnings, and takes consistently over 20 seconds.

Versions: PyArrow: 0.12.0 Dask: 1.1.1

Repro:

from dask.distributed import Client, wait, LocalCluster
import pyarrow as pa

ip = '0.0.0.0'
cluster = LocalCluster(ip=ip)
client = Client(cluster)

import dask.array as da
import dask.dataframe as dd

n_rows = 5000000
n_keys = 5000000

ddf = dd.concat([
    da.random.random(n_rows).to_dask_dataframe(columns='x'),
    da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1).persist()

def get_arrow(df):
    return pa.Table.from_pandas(df)

%time arrow_tables = ddf.map_partitions(get_arrow).compute()

Result:

distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 26% CPU time recently (threshold: 10%)
CPU times: user 20.6 s, sys: 1.17 s, total: 21.7 s
Wall time: 22.5 s

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:16 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Feb 11, 2019

Yup, totally agree. My example above shows that this might be because we’re putting these things into Pandas series, and apparently putting an Arrow Table into a Pandas Series takes several seconds. Probably this isn’t a problem with serialization or communication on the Dask end, it’s rather that by using Dask dataframe you’re trying to keep these things around in Pandas, which is currently borking. (At least until @TomAugspurger refactors the Pandas constructor logic).

Short term the solution here is probably to avoid calling map_partitions(pa.Table.from_pandas), which will try to form another dataframe object, and instead use Dask Delayed as @TomAugspurger suggests above:

tables = [dask.delayed(pa.Table.from_pandas)(x) for x in ddf.to_delayed()]

This will avoid trying to put PyArrow table objects in Pandas series, which seems to be the fundamental bug here.

0reactions
d6tdevcommented, Oct 20, 2019

Got this when calling .to_parquet() without having fastparquet installed so it used pyarrow to write and had trouble reading it. Problem was fixed by installing fastparquet.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Spark Performance Boosting | by Halil Ertan
Several storage levels are available in Spark, it might be set accordingly in terms of the serialization, memory, and data size factors.
Read more >
Streaming, Serialization, and IPC — Apache Arrow v10.0.1
File or Random Access format: for serializing a fixed number of record batches. Supports random access, and thus is very useful when used...
Read more >
Apache Arrow 3.0 - Hacker News
One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde....
Read more >
Optimizing performance of GATK workflows using Apache ...
We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) ... Third is to avoid (de)serialization of data when processing in ...
Read more >
Supercharging Visualization with Apache Arrow - KDnuggets
Apache Arrow provides a new way to exchange and visualize data at unprecedented ... and 2) CPU and memory-intensive data serialization.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found