question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When overwriting partitioned parquet files, Dask may leave old partitions present.

See original GitHub issue

What happened: Let’s define a pyarrow Dataset as a collection of parquet files.

When overwriting a pyarrow Dataset created by Dask, where the new Dataset has a fewer number of files than the original dataset, only the new number of partitioned files are overwritten, and the old “dangling” files are left in the directory.

What you expected to happen: A pyarrow Dataset, when written to disk, should contain the same number of files as the number of partitions in Dask. During an overwrite operation, I would expect any old files to be removed during the write operation. In the example below, I would expect len(files0) == len(files1) to be True.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask.distributed import Client
import numpy as np
import pandas as pd
from fsspec.implementations.local import LocalFileSystem

fs = LocalFileSystem()

# Create a Dask DataFrame if size (10000, 10) with 5 partitions and write to local
ddf = dd.from_pandas(pd.DataFrame(np.random.randint(low=0, high=100, size=(10000, 10)),
                                 columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]),
                     npartitions=5
                    )
ddf = ddf.reset_index(drop=True)
dd.to_parquet(ddf, "data0.parquet", engine="pyarrow")

# Read that DataFrame back and check its partitions
ddf2 = ddf.repartition(npartitions=3)

# Repartition the dataframe and write to a new location, and then overwrite the existing location
dd.to_parquet(ddf2, "data1.parquet", engine="pyarrow")
dd.to_parquet(ddf2, "data0.parquet", engine="pyarrow")
ddf3 = dd.read_parquet("data1.parquet", engine="pyarrow")

# Assert the # of files written are identical
files0 = fs.ls("data0.parquet")
files1 = fs.ls("data1.parquet")
assert len(files0) == len(files1)

Anything else we need to know?:

Environment:

  • Dask version: 2.30.0
  • Python version: 3.6.9
  • Operating System: Linux
  • Install method (conda, pip, source): pip
  • Pyarrow version:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jrbourbeaucommented, Nov 20, 2020

I believe this was closed by https://github.com/dask/dask/pull/6825, though feel free to re-open if that’s not the case

1reaction
hayesgbcommented, Nov 10, 2020

A very common ML engineering use case is to overwrite files on a scheduled basis. The fact that Dask silently allows the write to occur, leaving the dangling parquet files seems problematic. I included the proposal to explicitly request an overwrite.

By analog, Spark has the same overwrite keyword, which can permit a remove and write, or with “dynamic” it will only overwrite changed partitions.

It looks like appending is only permitted for newly added partitions (in different directories). Is this correct?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask Dataframe and Parquet
By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are...
Read more >
Dask DataFrame.to_parquet fails on read - repartition
When Dask reads a Parquet file, it opens the file and might attempt to read one partition in memory to infer the data...
Read more >
4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly
Dask only implements filtered reads and writes for Parquet. If you are reading data in another partitioned format, you can ignore the filtering...
Read more >
Dataset — NVTabular 2021 documentation
By default, both parquet and csv-based data will be converted to a Dask-DataFrame collection with a maximum partition size of roughly 12.5 percent...
Read more >
fastparquet Documentation - Read the Docs
7. data partitioning using the directory structure ... 1. read and write Parquet files, in single or multiple-file format.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found