When overwriting partitioned parquet files, Dask may leave old partitions present.
See original GitHub issueWhat happened: Let’s define a pyarrow Dataset as a collection of parquet files.
When overwriting a pyarrow Dataset created by Dask, where the new Dataset has a fewer number of files than the original dataset, only the new number of partitioned files are overwritten, and the old “dangling” files are left in the directory.
What you expected to happen: A pyarrow Dataset, when written to disk, should contain the same number of files as the number of partitions in Dask. During an overwrite operation, I would expect any old files to be removed during the write operation. In the example below, I would expect len(files0) == len(files1) to be True.
Minimal Complete Verifiable Example:
import dask.dataframe as dd
from dask.distributed import Client
import numpy as np
import pandas as pd
from fsspec.implementations.local import LocalFileSystem
fs = LocalFileSystem()
# Create a Dask DataFrame if size (10000, 10) with 5 partitions and write to local
ddf = dd.from_pandas(pd.DataFrame(np.random.randint(low=0, high=100, size=(10000, 10)),
columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]),
npartitions=5
)
ddf = ddf.reset_index(drop=True)
dd.to_parquet(ddf, "data0.parquet", engine="pyarrow")
# Read that DataFrame back and check its partitions
ddf2 = ddf.repartition(npartitions=3)
# Repartition the dataframe and write to a new location, and then overwrite the existing location
dd.to_parquet(ddf2, "data1.parquet", engine="pyarrow")
dd.to_parquet(ddf2, "data0.parquet", engine="pyarrow")
ddf3 = dd.read_parquet("data1.parquet", engine="pyarrow")
# Assert the # of files written are identical
files0 = fs.ls("data0.parquet")
files1 = fs.ls("data1.parquet")
assert len(files0) == len(files1)
Anything else we need to know?:
Environment:
- Dask version: 2.30.0
- Python version: 3.6.9
- Operating System: Linux
- Install method (conda, pip, source): pip
- Pyarrow version:
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Dask Dataframe and Parquet
By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are...
Read more >Dask DataFrame.to_parquet fails on read - repartition
When Dask reads a Parquet file, it opens the file and might attempt to read one partition in memory to infer the data...
Read more >4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly
Dask only implements filtered reads and writes for Parquet. If you are reading data in another partitioned format, you can ignore the filtering...
Read more >Dataset — NVTabular 2021 documentation
By default, both parquet and csv-based data will be converted to a Dask-DataFrame collection with a maximum partition size of roughly 12.5 percent...
Read more >fastparquet Documentation - Read the Docs
7. data partitioning using the directory structure ... 1. read and write Parquet files, in single or multiple-file format.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I believe this was closed by https://github.com/dask/dask/pull/6825, though feel free to re-open if that’s not the case
A very common ML engineering use case is to overwrite files on a scheduled basis. The fact that Dask silently allows the write to occur, leaving the dangling parquet files seems problematic. I included the proposal to explicitly request an overwrite.
By analog, Spark has the same overwrite keyword, which can permit a remove and write, or with “dynamic” it will only overwrite changed partitions.
It looks like appending is only permitted for newly added partitions (in different directories). Is this correct?