question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`to_parquet` fails with dataframe built from futures

See original GitHub issue

What happened: Trying to write a dask dataframe built from pandas futures using the to_parquet method fails.

What you expected to happen: Writing the dataframe successfully (the to_csv method works).

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client


def create_partition(i):
    return pd.DataFrame({'x': [i]})

client = Client()
parts = [client.submit(create_partition, i) for i in range(2)]
dd.from_delayed(parts).to_parquet('output')
Stacktrace
distributed.worker - WARNING - Compute Failed
Function:  check_meta
args:      ('create_partition-d32679d122607182bc4efd4c9af07f3a', Empty DataFrame
Columns: [x]
Index: [], 'from_delayed')
kwargs:    {}
Exception: ValueError('Metadata mismatch found in `from_delayed`.\n\nExpected partition of type `pandas.core.frame.DataFrame` but got `str`')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_30101/3007309697.py in <module>
      9 client = Client()
     10 parts = [client.submit(create_partition, i) for i in range(2)]
---> 11 dd.from_delayed(parts).to_parquet('output')

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
   4538         from .io import to_parquet
   4539 
-> 4540         return to_parquet(self, path, *args, **kwargs)
   4541 
   4542     def to_orc(self, path, *args, **kwargs):

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, **kwargs)
    723     if compute:
    724         if write_metadata_file:
--> 725             return compute_as_if_collection(
    726                 DataFrame, graph, (final_name, 0), **compute_kwargs
    727             )

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/base.py in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs)
    313     schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
    314     dsk2 = optimization_function(cls)(dsk, keys, **kwargs)
--> 315     return schedule(dsk2, keys, **kwargs)
    316 
    317 

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2687                     should_rejoin = False
   2688             try:
-> 2689                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2690             finally:
   2691                 for f in futures.values():

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1964             else:
   1965                 local_worker = None
-> 1966             return self.sync(
   1967                 self._gather,
   1968                 futures,

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    858             return future
    859         else:
--> 860             return sync(
    861                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    862             )

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    324     if error[0]:
    325         typ, exc, tb = error[0]
--> 326         raise exc.with_traceback(tb)
    327     else:
    328         return result[0]

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/utils.py in f()
    307             if callback_timeout is not None:
    308                 future = asyncio.wait_for(future, callback_timeout)
--> 309             result[0] = yield future
    310         except Exception:
    311             error[0] = sys.exc_info()

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1829                             exc = CancelledError(key)
   1830                         else:
-> 1831                             raise exception.with_traceback(traceback)
   1832                         raise exc
   1833                     if errors == "skip":

~/miniconda3/envs/mlforecast/lib/python3.9/site-packages/dask/dataframe/utils.py in check_meta()
    404         )
    405 
--> 406     raise ValueError(
    407         "Metadata mismatch found%s.\n\n"
    408         "%s" % ((" in `%s`" % funcname if funcname else ""), errmsg)

ValueError: Metadata mismatch found in `from_delayed`.

Expected partition of type `pandas.core.frame.DataFrame` but got `str`

Anything else we need to know?: I believe this was introduced in #7968

Environment:

  • Dask version: 2021.9.1
  • Python version: 3.9.7
  • Operating System: Ubuntu
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Sep 29, 2021

Found the bug, a fix is in #8199.

1reaction
jcristcommented, Sep 29, 2021

Ah, sorry, I had assumed since the from_delayed call worked the df was fine, I hadn’t actually tested the to_parquet call. I think this is related to #8173, I’ll start on a fix for this today. For now though you can manually disable graph optimizations for the to_parquet call and that should fix things. Note that wrapping futures in delayed isn’t necessary, that does happen automatically for dd.from_delayed.

import dask
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client

def create_partition(i):
    return pd.DataFrame({'x': [i]})


if __name__ == "__main__":
    client = Client()
    parts = [client.submit(create_partition, i) for i in range(2)]

    df = dd.from_delayed(parts, meta={"x": int})
    # Skip computing automatically, then call `compute` manually with optimizations disabled
    df.to_parquet("output", compute=False).compute()

I’ve checked, and this does work (at least on main).

Read more comments on GitHub >

github_iconTop Results From Across the Web

PySpark timeout trying to repartition/write to parquet (Futures ...
My code looks like below. I think it failed near the last part, writing to parquet? But the explain logs suggests its executing...
Read more >
dask/dask - Gitter
futures = df.to_parquet(.., compute=False) try: client.compute(futures) except ... Specifically, I have made a df from concatting other dataframes.
Read more >
Can't write column cast as bool to_parquet with dask ... - GitHub
If I add a df.compute() to turn it into a pandas dataframe then pandas to_parquet works. I'm opening another new issue because I...
Read more >
dask.dataframe.to_parquet - Dask documentation
If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False. partition_onlist, default None.
Read more >
pandas.DataFrame.to_parquet
Write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. You can choose different parquet backends,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found