Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Processing large remote Parquet datsets

See original GitHub issue

Recently I ran into some issues processing large Parquet datasets which were stored in S3 (with the code snippet below) and raised a few corresponding issues:

I’m wondering if others folks have run into other issues themselves. I’d like to use this issue to solicit and track issues related to processing large, remote Parquet datasets.

import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client

# Use Coiled to create a large cluster.
# Though any Dask cluster would work just as well.
import coiled
cluster = coiled.Cluster(n_workers=200, worker_memory="16 GiB")
client = Client(cluster)

# Generate a large synthetic dataset and write it to S3
df = dask.datasets.timeseries(
    dtypes={str(i): int for i in range(1_000)},
    freq="1s",
    start="2000-01-01",
    end="2021-01-01",
)

storage_options={"config_kwargs": {"region_name": "us-east-1"}}
df.to_parquet(
    "s3://<bucket>/timeseries.parquet",
    engine="pyarrow-dataset",
    overwrite=True,
    storage_options=storage_options,
)

# Read in dataset and do some basic processing
df = dd.read_parquet(
    "s3://<bucket>/timeseries.parquet",
    engine="pyarrow-dataset",
    storage_options=storage_options,
)

def func(row: pd.Series) -> str:
    return f"{row[0]}-{row[1]}-{row[2]}"

df["new"] = df.apply(func, axis=1)

# Write processed dataset back to S3
df.to_parquet(
    "s3://<bucket>/timeseries-processed.parquet",
    engine="pyarrow-dataset",
    overwrite=True,
    storage_options=storage_options,
)

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, May 27, 2022

All the linked issues have been resolved, closing this in favor of more specific issues.

0reactions

mrocklincommented, May 27, 2022

Woo!

On Fri, May 27, 2022 at 9:50 AM Jim Crist-Harif @.***> wrote:

All the linked issues have been resolved, closing this in favor of more specific issues.

— Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/8031#issuecomment-1139691772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCA3XHWTBVPYPNJSBTVMDOJ7ANCNFSM5CCLN7WA . You are receiving this because you are subscribed to this thread.Message ID: @.***>