`to_parquet` fails with dataframe built from futures
See original GitHub issueWhat happened:
Trying to write a dask dataframe built from pandas futures using the to_parquet
method fails.
What you expected to happen:
Writing the dataframe successfully (the to_csv
method works).
Minimal Complete Verifiable Example:
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
def create_partition(i):
return pd.DataFrame({'x': [i]})
client = Client()
parts = [client.submit(create_partition, i) for i in range(2)]
dd.from_delayed(parts).to_parquet('output')
Stacktrace
distributed.worker - WARNING - Compute Failed
Function: check_meta
args: ('create_partition-d32679d122607182bc4efd4c9af07f3a', Empty DataFrame
Columns: [x]
Index: [], 'from_delayed')
kwargs: {}
Exception: ValueError('Metadata mismatch found in `from_delayed`.\n\nExpected partition of type `pandas.core.frame.DataFrame` but got `str`')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_30101/3007309697.py in <module>
9 client = Client()
10 parts = [client.submit(create_partition, i) for i in range(2)]
---> 11 dd.from_delayed(parts).to_parquet('output')
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
4538 from .io import to_parquet
4539
-> 4540 return to_parquet(self, path, *args, **kwargs)
4541
4542 def to_orc(self, path, *args, **kwargs):
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, **kwargs)
723 if compute:
724 if write_metadata_file:
--> 725 return compute_as_if_collection(
726 DataFrame, graph, (final_name, 0), **compute_kwargs
727 )
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/base.py in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs)
313 schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
314 dsk2 = optimization_function(cls)(dsk, keys, **kwargs)
--> 315 return schedule(dsk2, keys, **kwargs)
316
317
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2687 should_rejoin = False
2688 try:
-> 2689 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2690 finally:
2691 for f in futures.values():
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1964 else:
1965 local_worker = None
-> 1966 return self.sync(
1967 self._gather,
1968 futures,
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
858 return future
859 else:
--> 860 return sync(
861 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
862 )
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
324 if error[0]:
325 typ, exc, tb = error[0]
--> 326 raise exc.with_traceback(tb)
327 else:
328 return result[0]
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/utils.py in f()
307 if callback_timeout is not None:
308 future = asyncio.wait_for(future, callback_timeout)
--> 309 result[0] = yield future
310 except Exception:
311 error[0] = sys.exc_info()
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1829 exc = CancelledError(key)
1830 else:
-> 1831 raise exception.with_traceback(traceback)
1832 raise exc
1833 if errors == "skip":
~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/dataframe/utils.py in check_meta()
404 )
405
--> 406 raise ValueError(
407 "Metadata mismatch found%s.\n\n"
408 "%s" % ((" in `%s`" % funcname if funcname else ""), errmsg)
ValueError: Metadata mismatch found in `from_delayed`.
Expected partition of type `pandas.core.frame.DataFrame` but got `str`
Anything else we need to know?: I believe this was introduced in #7968
Environment:
- Dask version: 2021.9.1
- Python version: 3.9.7
- Operating System: Ubuntu
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (8 by maintainers)
Top Results From Across the Web
PySpark timeout trying to repartition/write to parquet (Futures ...
My code looks like below. I think it failed near the last part, writing to parquet? But the explain logs suggests its executing...
Read more >dask/dask - Gitter
futures = df.to_parquet(.., compute=False) try: client.compute(futures) except ... Specifically, I have made a df from concatting other dataframes.
Read more >Can't write column cast as bool to_parquet with dask ... - GitHub
If I add a df.compute() to turn it into a pandas dataframe then pandas to_parquet works. I'm opening another new issue because I...
Read more >dask.dataframe.to_parquet - Dask documentation
If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False. partition_onlist, default None.
Read more >pandas.DataFrame.to_parquet
Write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. You can choose different parquet backends,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Found the bug, a fix is in #8199.
Ah, sorry, I had assumed since the
from_delayed
call worked the df was fine, I hadn’t actually tested theto_parquet
call. I think this is related to #8173, I’ll start on a fix for this today. For now though you can manually disable graph optimizations for theto_parquet
call and that should fix things. Note that wrapping futures indelayed
isn’t necessary, that does happen automatically fordd.from_delayed
.I’ve checked, and this does work (at least on main).