ValueError: Metadata mismatch: Expected partition of type `DataFrame` but got `DataFrame`
See original GitHub issueI’m getting an error which suggests a possible dask validation code issue:
ValueError: Metadata mismatch found in
from_delayed
. Expected partition of typeDataFrame
but gotDataFrame
Since the expected type is the actual type and it is still throwing an error, it leads me to believe that perhaps the dask validation routine is not working well.
This happens when trying to combine several futures for csv loads. The code to reproduce this error is as follows:
from dask.distributed import Client, progress, get_worker
import dask.dataframe as dd
from dask import delayed
client = Client(<scheduling server host:port>)
load_tasks={}
task_worker_map={"yellow_tripdata_2009-01.csv":"<worker 1 server host:port>",
"yellow_tripdata_2017-01.csv":"<worker 1 server host:port>",
"yellow_tripdata_2017-02.csv":"<worker 2 server host:port>",
}
for file_to_load, worker_to_do in task_worker_map.items():
fut = client.submit(dd.read_csv,f"/home/saif/{file_to_load}", workers={worker_to_do})
load_tasks[file_to_load]={"file": file_to_load,
"submitted-to": worker_to_do,
"future": fut
}
futurez = [load_tasks[atask]["future"] for atask in load_tasks.keys()]
df_overarching = dd.from_delayed([delayed(f) for f in futurez])
df_overarching.head()
The full stack trace is:
>
> -----------------------
> ValueErrorTraceback (most recent call last)
> <ipython-input-17-59d0937d69ae> in <module>
> ----> 1 df_overarching.head()
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/dask/dataframe/core.py in head(self, n, npartitions, compute)
> 898
> 899 if compute:
> --> 900 result = result.compute()
> 901 return result
> 902
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
> 154 dask.base.compute
> 155 """
> --> 156 (result,) = compute(self, traverse=False, **kwargs)
> 157 return result
> 158
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
> 396 keys = [x.__dask_keys__() for x in collections]
> 397 postcomputes = [x.__dask_postcompute__() for x in collections]
> --> 398 results = schedule(dsk, keys, **kwargs)
> 399 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
> 400
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
> 2566 should_rejoin = False
> 2567 try:
> -> 2568 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
> 2569 finally:
> 2570 for f in futures.values():
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
> 1820 direct=direct,
> 1821 local_worker=local_worker,
> -> 1822 asynchronous=asynchronous,
> 1823 )
> 1824
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
> 751 return future
> 752 else:
> --> 753 return sync(self.loop, func, *args, **kwargs)
> 754
> 755 def __repr__(self):
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
> 329 e.wait(10)
> 330 if error[0]:
> --> 331 six.reraise(*error[0])
> 332 else:
> 333 return result[0]
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
> 691 if value.__traceback__ is not tb:
> 692 raise value.with_traceback(tb)
> --> 693 raise value
> 694 finally:
> 695 value = None
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/utils.py in f()
> 314 if timeout is not None:
> 315 future = gen.with_timeout(timedelta(seconds=timeout), future)
> --> 316 result[0] = yield future
> 317 except Exception as exc:
> 318 error[0] = sys.exc_info()
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/tornado/gen.py in run(self)
> 727
> 728 try:
> --> 729 value = future.result()
> 730 except Exception:
> 731 exc_info = sys.exc_info()
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/tornado/gen.py in run(self)
> 734 if exc_info is not None:
> 735 try:
> --> 736 yielded = self.gen.throw(*exc_info) # type: ignore
> 737 finally:
> 738 # Break up a reference to itself
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
> 1651 six.reraise(CancelledError, CancelledError(key), None)
> 1652 else:
> -> 1653 six.reraise(type(exception), exception, traceback)
> 1654 if errors == "skip":
> 1655 bad_keys.add(key)
>
> ~/miniconda3/envs/churn/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
> 690 value = tp()
> 691 if value.__traceback__ is not tb:
> --> 692 raise value.with_traceback(tb)
> 693 raise value
> 694 finally:
>
> /home/saif/miniconda3/envs/churn/lib/python3.6/site-packages/dask/dataframe/utils.py in check_meta()
>
> ValueError: Metadata mismatch found in `from_delayed`.
>
> Expected partition of type `DataFrame` but got `DataFrame`
This was run on Ubuntu 16 with Python 3.6 using dask==1.2.2 distributed==1.28.0
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top Results From Across the Web
DASK Metadata mismatch found in 'from_delayed' JSON file
I have a dataset in the json format. I loaded the data via dd.read_json to dataframe and everything goes well. The problem occurred...
Read more >Metadata mismatch for dask dataframe after using filter()
I noticed weird behaviour when filtering an azureml TabularDataset instance using filter() and converting it to a dask dataframe afterwards.
Read more >Dask DataFrame - parallelized pandas
dataframe module implements a “blocked parallel” DataFrame object that looks and feels like the pandas API, but for parallel and distributed workflows. One...
Read more >Reduce memory usage with Dask dtypes - Coiled
This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column ......
Read more >Dealing with the hack for dta - Dask Forum - Discourse
Metadata mismatch found in from_delayed . Partition type: pandas.core.frame.DataFrame. Is there a way to read the dta without iterating?
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
#4819 improves the error message that is raised in the originally posted example
Closing. I don’t immediately see an easy way to error / warn on this behavior earlier. And hopefully the improved error message would lead the user to see that passing
dd.read_csv
there is inappropriate.