Dask's Arrow serialization slow & memory intensive
See original GitHub issueI’m creating a dummy 80MB single-partition Dask distributed DataFrame, and attempting to convert it to a PyArrow Table.
Doing so causes a notebook to throw GC warnings, and takes consistently over 20 seconds.
Versions: PyArrow: 0.12.0 Dask: 1.1.1
Repro:
from dask.distributed import Client, wait, LocalCluster
import pyarrow as pa
ip = '0.0.0.0'
cluster = LocalCluster(ip=ip)
client = Client(cluster)
import dask.array as da
import dask.dataframe as dd
n_rows = 5000000
n_keys = 5000000
ddf = dd.concat([
da.random.random(n_rows).to_dask_dataframe(columns='x'),
da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1).persist()
def get_arrow(df):
return pa.Table.from_pandas(df)
%time arrow_tables = ddf.map_partitions(get_arrow).compute()
Result:
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 26% CPU time recently (threshold: 10%)
CPU times: user 20.6 s, sys: 1.17 s, total: 21.7 s
Wall time: 22.5 s
Issue Analytics
- State:
- Created 5 years ago
- Comments:16 (13 by maintainers)
Top Results From Across the Web
Apache Spark Performance Boosting | by Halil Ertan
Several storage levels are available in Spark, it might be set accordingly in terms of the serialization, memory, and data size factors.
Read more >Streaming, Serialization, and IPC — Apache Arrow v10.0.1
File or Random Access format: for serializing a fixed number of record batches. Supports random access, and thus is very useful when used...
Read more >Apache Arrow 3.0 - Hacker News
One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde....
Read more >Optimizing performance of GATK workflows using Apache ...
We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) ... Third is to avoid (de)serialization of data when processing in ...
Read more >Supercharging Visualization with Apache Arrow - KDnuggets
Apache Arrow provides a new way to exchange and visualize data at unprecedented ... and 2) CPU and memory-intensive data serialization.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yup, totally agree. My example above shows that this might be because we’re putting these things into Pandas series, and apparently putting an Arrow Table into a Pandas Series takes several seconds. Probably this isn’t a problem with serialization or communication on the Dask end, it’s rather that by using Dask dataframe you’re trying to keep these things around in Pandas, which is currently borking. (At least until @TomAugspurger refactors the Pandas constructor logic).
Short term the solution here is probably to avoid calling
map_partitions(pa.Table.from_pandas)
, which will try to form another dataframe object, and instead use Dask Delayed as @TomAugspurger suggests above:This will avoid trying to put PyArrow table objects in Pandas series, which seems to be the fundamental bug here.
Got this when calling
.to_parquet()
without having fastparquet installed so it used pyarrow to write and had trouble reading it. Problem was fixed by installing fastparquet.