Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When overwriting partitioned parquet files, Dask may leave old partitions present.

See original GitHub issue

What happened: Let’s define a pyarrow Dataset as a collection of parquet files.

When overwriting a pyarrow Dataset created by Dask, where the new Dataset has a fewer number of files than the original dataset, only the new number of partitioned files are overwritten, and the old “dangling” files are left in the directory.

What you expected to happen: A pyarrow Dataset, when written to disk, should contain the same number of files as the number of partitions in Dask. During an overwrite operation, I would expect any old files to be removed during the write operation. In the example below, I would expect len(files0) == len(files1) to be True.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask.distributed import Client
import numpy as np
import pandas as pd
from fsspec.implementations.local import LocalFileSystem

fs = LocalFileSystem()

# Create a Dask DataFrame if size (10000, 10) with 5 partitions and write to local
ddf = dd.from_pandas(pd.DataFrame(np.random.randint(low=0, high=100, size=(10000, 10)),
                                 columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]),
                     npartitions=5
                    )
ddf = ddf.reset_index(drop=True)
dd.to_parquet(ddf, "data0.parquet", engine="pyarrow")

# Read that DataFrame back and check its partitions
ddf2 = ddf.repartition(npartitions=3)

# Repartition the dataframe and write to a new location, and then overwrite the existing location
dd.to_parquet(ddf2, "data1.parquet", engine="pyarrow")
dd.to_parquet(ddf2, "data0.parquet", engine="pyarrow")
ddf3 = dd.read_parquet("data1.parquet", engine="pyarrow")

# Assert the # of files written are identical
files0 = fs.ls("data0.parquet")
files1 = fs.ls("data1.parquet")
assert len(files0) == len(files1)

Anything else we need to know?:

Environment:

Dask version: 2.30.0
Python version: 3.6.9
Operating System: Linux
Install method (conda, pip, source): pip
Pyarrow version:

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jrbourbeaucommented, Nov 20, 2020

I believe this was closed by https://github.com/dask/dask/pull/6825, though feel free to re-open if that’s not the case

1reaction

hayesgbcommented, Nov 10, 2020

A very common ML engineering use case is to overwrite files on a scheduled basis. The fact that Dask silently allows the write to occur, leaving the dangling parquet files seems problematic. I included the proposal to explicitly request an overwrite.

By analog, Spark has the same overwrite keyword, which can permit a remove and write, or with “dynamic” it will only overwrite changed partitions.

It looks like appending is only permitted for newly added partitions (in different directories). Is this correct?

Top Results From Across the Web

Dask Dataframe and Parquet

By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are...

Dask DataFrame.to_parquet fails on read - repartition

When Dask reads a Parquet file, it opens the file and might attempt to read one partition in memory to infer the data...

4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly

Dask only implements filtered reads and writes for Parquet. If you are reading data in another partitioned format, you can ignore the filtering...

Dataset — NVTabular 2021 documentation

By default, both parquet and csv-based data will be converted to a Dask-DataFrame collection with a maximum partition size of roughly 12.5 percent...

fastparquet Documentation - Read the Docs

7. data partitioning using the directory structure ... 1. read and write Parquet files, in single or multiple-file format.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

When overwriting partitioned parquet files, Dask may leave old partitions present.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Massive memory (100GB) used by dask-scheduler

read_parquet fails when reading parquet folder from s3