question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError: Metadata mismatch: Expected partition of type `DataFrame` but got `DataFrame`

See original GitHub issue

I’m getting an error which suggests a possible dask validation code issue:

ValueError: Metadata mismatch found in from_delayed. Expected partition of type DataFrame but got DataFrame

Since the expected type is the actual type and it is still throwing an error, it leads me to believe that perhaps the dask validation routine is not working well.

This happens when trying to combine several futures for csv loads. The code to reproduce this error is as follows:

from dask.distributed import Client, progress, get_worker
import dask.dataframe as dd
from dask import delayed

client = Client(<scheduling server host:port>)
load_tasks={}
task_worker_map={"yellow_tripdata_2009-01.csv":"<worker 1 server host:port>",
                 "yellow_tripdata_2017-01.csv":"<worker 1 server host:port>",
                 "yellow_tripdata_2017-02.csv":"<worker 2 server host:port>",   
                }

for file_to_load, worker_to_do in task_worker_map.items():
    fut = client.submit(dd.read_csv,f"/home/saif/{file_to_load}", workers={worker_to_do})
    load_tasks[file_to_load]={"file": file_to_load,
                   "submitted-to": worker_to_do,
                   "future": fut
                  }

futurez = [load_tasks[atask]["future"] for atask in load_tasks.keys()]
df_overarching = dd.from_delayed([delayed(f) for f in futurez])
df_overarching.head()

The full stack trace is:


> 
> -----------------------
> ValueErrorTraceback (most recent call last)
> <ipython-input-17-59d0937d69ae> in <module>
> ----> 1 df_overarching.head()
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/dask/dataframe/core.py in head(self, n, npartitions, compute)
>     898 
>     899         if compute:
> --> 900             result = result.compute()
>     901         return result
>     902 
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
>     154         dask.base.compute
>     155         """
> --> 156         (result,) = compute(self, traverse=False, **kwargs)
>     157         return result
>     158 
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
>     396     keys = [x.__dask_keys__() for x in collections]
>     397     postcomputes = [x.__dask_postcompute__() for x in collections]
> --> 398     results = schedule(dsk, keys, **kwargs)
>     399     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
>     400 
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
>    2566                     should_rejoin = False
>    2567             try:
> -> 2568                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
>    2569             finally:
>    2570                 for f in futures.values():
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
>    1820                 direct=direct,
>    1821                 local_worker=local_worker,
> -> 1822                 asynchronous=asynchronous,
>    1823             )
>    1824 
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
>     751             return future
>     752         else:
> --> 753             return sync(self.loop, func, *args, **kwargs)
>     754 
>     755     def __repr__(self):
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
>     329             e.wait(10)
>     330     if error[0]:
> --> 331         six.reraise(*error[0])
>     332     else:
>     333         return result[0]
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
>     691             if value.__traceback__ is not tb:
>     692                 raise value.with_traceback(tb)
> --> 693             raise value
>     694         finally:
>     695             value = None
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/utils.py in f()
>     314             if timeout is not None:
>     315                 future = gen.with_timeout(timedelta(seconds=timeout), future)
> --> 316             result[0] = yield future
>     317         except Exception as exc:
>     318             error[0] = sys.exc_info()
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/tornado/gen.py in run(self)
>     727 
>     728                     try:
> --> 729                         value = future.result()
>     730                     except Exception:
>     731                         exc_info = sys.exc_info()
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/tornado/gen.py in run(self)
>     734                     if exc_info is not None:
>     735                         try:
> --> 736                             yielded = self.gen.throw(*exc_info)  # type: ignore
>     737                         finally:
>     738                             # Break up a reference to itself
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
>    1651                             six.reraise(CancelledError, CancelledError(key), None)
>    1652                         else:
> -> 1653                             six.reraise(type(exception), exception, traceback)
>    1654                     if errors == "skip":
>    1655                         bad_keys.add(key)
> 
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
>     690                 value = tp()
>     691             if value.__traceback__ is not tb:
> --> 692                 raise value.with_traceback(tb)
>     693             raise value
>     694         finally:
> 
> /home/saif/miniconda3/envs/churn/lib/python3.6/site-packages/dask/dataframe/utils.py in check_meta()
> 
> ValueError: Metadata mismatch found in `from_delayed`.
> 
> Expected partition of type `DataFrame` but got `DataFrame`

This was run on Ubuntu 16 with Python 3.6 using dask==1.2.2 distributed==1.28.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jrbourbeaucommented, May 18, 2019

#4819 improves the error message that is raised in the originally posted example

0reactions
TomAugspurgercommented, May 21, 2019

Closing. I don’t immediately see an easy way to error / warn on this behavior earlier. And hopefully the improved error message would lead the user to see that passing dd.read_csv there is inappropriate.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DASK Metadata mismatch found in 'from_delayed' JSON file
I have a dataset in the json format. I loaded the data via dd.read_json to dataframe and everything goes well. The problem occurred...
Read more >
Metadata mismatch for dask dataframe after using filter()
I noticed weird behaviour when filtering an azureml TabularDataset instance using filter() and converting it to a dask dataframe afterwards.
Read more >
Dask DataFrame - parallelized pandas
dataframe module implements a “blocked parallel” DataFrame object that looks and feels like the pandas API, but for parallel and distributed workflows. One...
Read more >
Reduce memory usage with Dask dtypes - Coiled
This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column ......
Read more >
Dealing with the hack for dta - Dask Forum - Discourse
Metadata mismatch found in from_delayed . Partition type: pandas.core.frame.DataFrame. Is there a way to read the dta without iterating?
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found