question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unable to read multiple hdfs after dask update

See original GitHub issue

What happened: after upgrading dask version above >=2021.06.1 reading with dd.read_hdf(“*.hdf5”) code is failing with:

      5 hdf5files="/lustre/arm2arm/ipython/GaiaEDR3/tmp/*.hdf5"
----> 6 ddf=dd.read_hdf(hdf5files,key="G18")

~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in read_hdf(pattern, key, start, stop, columns, chunksize, sorted_index, lock, mode)
    423 
    424     # Build parts
--> 425     parts, divisions = _build_parts(
    426         paths, key, start, stop, chunksize, sorted_index, mode
    427     )

~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in _build_parts(paths, key, start, stop, chunksize, sorted_index, mode)
    449     for path in paths:
    450 
--> 451         keys, stops, divisions = _get_keys_stops_divisions(
    452             path, key, stop, sorted_index, chunksize, mode
    453         )

~/.conda/envs/py38dask2/lib/python3.8/site-packages/dask/dataframe/io/hdf.py in _get_keys_stops_divisions(path, key, stop, sorted_index, chunksize, mode)
    520                 stops.append(storer.nrows)
    521             elif stop > storer.nrows:
--> 522                 raise ValueError(
    523                     "Stop keyword exceeds dataset number "
    524                     "of rows ({})".format(storer.nrows)

ValueError: Stop keyword exceeds dataset number of rows (3697)

What you expected to happen: reading all data as expected

Minimal Complete Verifiable Example: pip install dask==2021.06.1 distributed==2021.06.1

# Put your MCVE code here
import numpy as np
import dask.dataframe as dd
# only 2 files are there
hdf5files="/lustre/arm2arm/ipython/GaiaEDR3/tmp/*.hdf5"
ddf=dd.read_hdf(hdf5files,key="G18")

Anything else we need to know?:

pip install dask==2021.06.1 distributed==2021.06.1
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting dask==2021.06.1
  Downloading dask-2021.6.1-py3-none-any.whl (973 kB)
     |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 973 kB 4.5 MB/s 
Collecting distributed==2021.06.1
  Downloading distributed-2021.6.1-py3-none-any.whl (722 kB)
     |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 722 kB 113.7 MB/s 
Requirement already satisfied: fsspec>=0.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (2021.7.0)
Requirement already satisfied: toolz>=0.8.2 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (0.11.1)
Requirement already satisfied: cloudpickle>=1.1.1 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (1.6.0)
Requirement already satisfied: pyyaml in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (5.4.1)
Requirement already satisfied: partd>=0.3.10 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from dask==2021.06.1) (1.2.0)
Requirement already satisfied: msgpack>=0.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (1.0.2)
Requirement already satisfied: tornado>=6.0.3 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (6.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (2.4.0)
Requirement already satisfied: setuptools in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (52.0.0.post20210125)
Requirement already satisfied: click>=6.6 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (8.0.1)
Requirement already satisfied: tblib>=1.6.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (1.7.0)
Requirement already satisfied: zict>=0.1.3 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (2.0.0)
Requirement already satisfied: psutil>=5.0 in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from distributed==2021.06.1) (5.8.0)
Requirement already satisfied: locket in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from partd>=0.3.10->dask==2021.06.1) (0.2.1)
Requirement already satisfied: heapdict in ./.conda/envs/py38dask2/lib/python3.8/site-packages (from zict>=0.1.3->distributed==2021.06.1) (1.0.1)
Installing collected packages: dask, distributed
  Attempting uninstall: dask
    Found existing installation: dask 2021.6.0
    Uninstalling dask-2021.6.0:
      Successfully uninstalled dask-2021.6.0
  Attempting uninstall: distributed
    Found existing installation: distributed 2021.6.0
    Uninstalling distributed-2021.6.0:
      Successfully uninstalled distributed-2021.6.0
Successfully installed dask-2021.6.1 distributed-2021.6.1

Environment:

  • Dask version: dask-2021.6.1 distributed-2021.6.1 and above is failing
  • Python version: 3.8.x
  • Operating System: UB 18.x
  • Install method (conda, pip, source): pip install dask==2021.06.1 distributed==2021.06.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jsignellcommented, Aug 13, 2021

It will be one partition per file. I think it should be fine. But also the fix was pretty trivial so I’m hoping we can get it into this release.

0reactions
arm2armcommented, Aug 13, 2021

Interesting, thanks for the workaround, but would’t explode my memory? or how is the concat managing the partitions, assuming you have 100K files and only 16 workers?

Read more comments on GitHub >

github_iconTop Results From Across the Web

read_parquet unable to load list of hdfs files #3163 - GitHub
I updated my dask version to version 0.17.0 with the long awaited feature of passing list of files into read_parquet.
Read more >
Dask Distributed: Parallelizing read and analysis of lots of ...
It seems like nearly every answer I find (when Googling related terms) is about loading multiple files into a single data frame. What...
Read more >
dask.dataframe.read_csv - Dask documentation
It can read CSV files from external resources (e.g. S3, HDFS) by ... Usually this works fine, but if the dtype is different...
Read more >
Data Loading and Input — dask-sql ... - Read the Docs
You have multiple possibilities to load input data in dask-sql : ... exists already a function to load your favorite format or location...
Read more >
Dask - How to handle large dataframes in python using ...
dataframe different from pandas.dataframe? Introduction to Dask Bags; How to use Dask Bag for various operations? Distributed computing with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found