Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unable to read multiple hdfs after dask update

See original GitHub issue

What happened: after upgrading dask version above >=2021.06.1 reading with dd.read_hdf(“*.hdf5”) code is failing with:

      5 hdf5files="/lustre/arm2arm/ipython/GaiaEDR3/tmp/*.hdf5"
----> 6 ddf=dd.read_hdf(hdf5files,key="G18")

~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in read_hdf(pattern, key, start, stop, columns, chunksize, sorted_index, lock, mode)
    423 
    424     # Build parts
--> 425     parts, divisions = _build_parts(
    426         paths, key, start, stop, chunksize, sorted_index, mode
    427     )

~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in _build_parts(paths, key, start, stop, chunksize, sorted_index, mode)
    449     for path in paths:
    450 
--> 451         keys, stops, divisions = _get_keys_stops_divisions(
    452             path, key, stop, sorted_index, chunksize, mode
    453         )

~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in _get_keys_stops_divisions(path, key, stop, sorted_index, chunksize, mode)
    520                 stops.append(storer.nrows)
    521             elif stop > storer.nrows:
--> 522                 raise ValueError(
    523                     "Stop keyword exceeds dataset number "
    524                     "of rows ({})".format(storer.nrows)

ValueError: Stop keyword exceeds dataset number of rows (3697)

What you expected to happen: reading all data as expected

Minimal Complete Verifiable Example: pip install dask==2021.06.1 distributed==2021.06.1

# Put your MCVE code here
import numpy as np
import dask.dataframe as dd
# only 2 files are there
hdf5files="/lustre/arm2arm/ipython/GaiaEDR3/tmp/*.hdf5"
ddf=dd.read_hdf(hdf5files,key="G18")

Anything else we need to know?:

pip install dask==2021.06.1 distributed==2021.06.1
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting dask==2021.06.1
  Downloading dask-2021.6.1-py3-none-any.whl (973 kB)
     |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 973 kB 4.5 MB/s 
Collecting distributed==2021.06.1
  Downloading distributed-2021.6.1-py3-none-any.whl (722 kB)
     |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 722 kB 113.7 MB/s 
Requirement already satisfied: fsspec>=0.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (2021.7.0)
Requirement already satisfied: toolz>=0.8.2 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (0.11.1)
Requirement already satisfied: cloudpickle>=1.1.1 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (1.6.0)
Requirement already satisfied: pyyaml in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (5.4.1)
Requirement already satisfied: partd>=0.3.10 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (1.2.0)
Requirement already satisfied: msgpack>=0.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (1.0.2)
Requirement already satisfied: tornado>=6.0.3 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (6.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (2.4.0)
Requirement already satisfied: setuptools in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (52.0.0.post20210125)
Requirement already satisfied: click>=6.6 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (8.0.1)
Requirement already satisfied: tblib>=1.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (1.7.0)
Requirement already satisfied: zict>=0.1.3 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (2.0.0)
Requirement already satisfied: psutil>=5.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (5.8.0)
Requirement already satisfied: locket in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from partd>=0.3.10->dask==2021.06.1) (0.2.1)
Requirement already satisfied: heapdict in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from zict>=0.1.3->distributed==2021.06.1) (1.0.1)
Installing collected packages: dask, distributed
  Attempting uninstall: dask
    Found existing installation: dask 2021.6.0
    Uninstalling dask-2021.6.0:
      Successfully uninstalled dask-2021.6.0
  Attempting uninstall: distributed
    Found existing installation: distributed 2021.6.0
    Uninstalling distributed-2021.6.0:
      Successfully uninstalled distributed-2021.6.0
Successfully installed dask-2021.6.1 distributed-2021.6.1

Environment:

Dask version: dask-2021.6.1 distributed-2021.6.1 and above is failing
Python version: 3.8.x
Operating System: UB 18.x
Install method (conda, pip, source): pip install dask==2021.06.1 distributed==2021.06.1

Issue Analytics

State:
Created 2 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

jsignellcommented, Aug 13, 2021

It will be one partition per file. I think it should be fine. But also the fix was pretty trivial so I’m hoping we can get it into this release.

0reactions

arm2armcommented, Aug 13, 2021

Interesting, thanks for the workaround, but would’t explode my memory? or how is the concat managing the partitions, assuming you have 100K files and only 16 workers?