Processing large remote Parquet datsets
See original GitHub issueRecently I ran into some issues processing large Parquet datasets which were stored in S3 (with the code snippet below) and raised a few corresponding issues:
- https://github.com/dask/dask/issues/8026
- https://github.com/dask/dask/issues/8027
- https://github.com/dask/dask/issues/8030
- https://github.com/dask/dask/issues/8028
- https://github.com/dask/dask/issues/8056
I’m wondering if others folks have run into other issues themselves. I’d like to use this issue to solicit and track issues related to processing large, remote Parquet datasets.
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
# Use Coiled to create a large cluster.
# Though any Dask cluster would work just as well.
import coiled
cluster = coiled.Cluster(n_workers=200, worker_memory="16 GiB")
client = Client(cluster)
# Generate a large synthetic dataset and write it to S3
df = dask.datasets.timeseries(
dtypes={str(i): int for i in range(1_000)},
freq="1s",
start="2000-01-01",
end="2021-01-01",
)
storage_options={"config_kwargs": {"region_name": "us-east-1"}}
df.to_parquet(
"s3://<bucket>/timeseries.parquet",
engine="pyarrow-dataset",
overwrite=True,
storage_options=storage_options,
)
# Read in dataset and do some basic processing
df = dd.read_parquet(
"s3://<bucket>/timeseries.parquet",
engine="pyarrow-dataset",
storage_options=storage_options,
)
def func(row: pd.Series) -> str:
return f"{row[0]}-{row[1]}-{row[2]}"
df["new"] = df.apply(func, axis=1)
# Write processed dataset back to S3
df.to_parquet(
"s3://<bucket>/timeseries-processed.parquet",
engine="pyarrow-dataset",
overwrite=True,
storage_options=storage_options,
)
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Parquet files and data sets on a remote file system with ...
So, in medias res; we want to be able to read and write single parquet files and partitioned parquet data sets on a...
Read more >Big Data and Parquet - SoftwareMill
Efficient data storage solution with Parquet format. Pros and cons for your Big Data workflows.
Read more >Optimizing Access to Parquet Data with fsspec
In this post, we introduce the fsspec.parquet module, which provides a format-aware, byte-caching optimization for remote Parquet files.
Read more >Improving the Parquet Experience with Dask - Coiled
Recent versions of Dask provide a much smoother user experience when processing Parquet datasets (especially large ones).
Read more >Loading Parquet data from Cloud Storage | BigQuery
Loading Parquet data into a new table · For Dataset, select the dataset in which you want to create the table. · In...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
All the linked issues have been resolved, closing this in favor of more specific issues.
Woo!
On Fri, May 27, 2022 at 9:50 AM Jim Crist-Harif @.***> wrote: