unable to read multiple hdfs after dask update
See original GitHub issueWhat happened: after upgrading dask version above >=2021.06.1 reading with dd.read_hdf(“*.hdf5”) code is failing with:
5 hdf5files="/lustre/arm2arm/ipython/GaiaEDR3/tmp/*.hdf5"
----> 6 ddf=dd.read_hdf(hdf5files,key="G18")
~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in read_hdf(pattern, key, start, stop, columns, chunksize, sorted_index, lock, mode)
423
424 # Build parts
--> 425 parts, divisions = _build_parts(
426 paths, key, start, stop, chunksize, sorted_index, mode
427 )
~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in _build_parts(paths, key, start, stop, chunksize, sorted_index, mode)
449 for path in paths:
450
--> 451 keys, stops, divisions = _get_keys_stops_divisions(
452 path, key, stop, sorted_index, chunksize, mode
453 )
~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in _get_keys_stops_divisions(path, key, stop, sorted_index, chunksize, mode)
520 stops.append(storer.nrows)
521 elif stop > storer.nrows:
--> 522 raise ValueError(
523 "Stop keyword exceeds dataset number "
524 "of rows ({})".format(storer.nrows)
ValueError: Stop keyword exceeds dataset number of rows (3697)
What you expected to happen: reading all data as expected
Minimal Complete Verifiable Example: pip install dask==2021.06.1 distributed==2021.06.1
# Put your MCVE code here
import numpy as np
import dask.dataframe as dd
# only 2 files are there
hdf5files="/lustre/arm2arm/ipython/GaiaEDR3/tmp/*.hdf5"
ddf=dd.read_hdf(hdf5files,key="G18")
Anything else we need to know?:
pip install dask==2021.06.1 distributed==2021.06.1
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting dask==2021.06.1
Downloading dask-2021.6.1-py3-none-any.whl (973 kB)
|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 973 kB 4.5 MB/s
Collecting distributed==2021.06.1
Downloading distributed-2021.6.1-py3-none-any.whl (722 kB)
|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 722 kB 113.7 MB/s
Requirement already satisfied: fsspec>=0.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (2021.7.0)
Requirement already satisfied: toolz>=0.8.2 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (0.11.1)
Requirement already satisfied: cloudpickle>=1.1.1 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (1.6.0)
Requirement already satisfied: pyyaml in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (5.4.1)
Requirement already satisfied: partd>=0.3.10 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (1.2.0)
Requirement already satisfied: msgpack>=0.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (1.0.2)
Requirement already satisfied: tornado>=6.0.3 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (6.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (2.4.0)
Requirement already satisfied: setuptools in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (52.0.0.post20210125)
Requirement already satisfied: click>=6.6 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (8.0.1)
Requirement already satisfied: tblib>=1.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (1.7.0)
Requirement already satisfied: zict>=0.1.3 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (2.0.0)
Requirement already satisfied: psutil>=5.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (5.8.0)
Requirement already satisfied: locket in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from partd>=0.3.10->dask==2021.06.1) (0.2.1)
Requirement already satisfied: heapdict in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from zict>=0.1.3->distributed==2021.06.1) (1.0.1)
Installing collected packages: dask, distributed
Attempting uninstall: dask
Found existing installation: dask 2021.6.0
Uninstalling dask-2021.6.0:
Successfully uninstalled dask-2021.6.0
Attempting uninstall: distributed
Found existing installation: distributed 2021.6.0
Uninstalling distributed-2021.6.0:
Successfully uninstalled distributed-2021.6.0
Successfully installed dask-2021.6.1 distributed-2021.6.1
Environment:
- Dask version: dask-2021.6.1 distributed-2021.6.1 and above is failing
- Python version: 3.8.x
- Operating System: UB 18.x
- Install method (conda, pip, source):
pip install dask==2021.06.1 distributed==2021.06.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
read_parquet unable to load list of hdfs files #3163 - GitHub
I updated my dask version to version 0.17.0 with the long awaited feature of passing list of files into read_parquet.
Read more >Dask Distributed: Parallelizing read and analysis of lots of ...
It seems like nearly every answer I find (when Googling related terms) is about loading multiple files into a single data frame. What...
Read more >dask.dataframe.read_csv - Dask documentation
It can read CSV files from external resources (e.g. S3, HDFS) by ... Usually this works fine, but if the dtype is different...
Read more >Data Loading and Input — dask-sql ... - Read the Docs
You have multiple possibilities to load input data in dask-sql : ... exists already a function to load your favorite format or location...
Read more >Dask - How to handle large dataframes in python using ...
dataframe different from pandas.dataframe? Introduction to Dask Bags; How to use Dask Bag for various operations? Distributed computing with ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

It will be one partition per file. I think it should be fine. But also the fix was pretty trivial so I’m hoping we can get it into this release.
Interesting, thanks for the workaround, but would’t explode my memory? or how is the concat managing the partitions, assuming you have 100K files and only 16 workers?