question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_hdf fails in dask even though it works in pandas

See original GitHub issue

dask version 0.9.0, pandas 0.18.1 (most recent from conda as of posting)

Grab a 1 MB fake TAQ file from here. (aside: same data is ~1/4 MB in zipped fixed width - chunk sizes are probably dumb for this data.)

pd.read_hdf('small_test_data_public.h5', '/IXQAJE/no_suffix') works, dask.dataframe.read_hdf('small_test_data_public.h5', '/IXQAJE/no_suffix') fails with the following stack trace. I think it may be due to the attempt to read an empty dataframe of 0-length. If I read the intent correctly, it would probably make more sense to retrieve a pytables or h5py object which would provide the desired metadata without the weirdness around a 0-length read.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-8ad7ff0d4733> in <module>()
----> 1 spy_dd = dd.read_hdf(fname, max_sym)

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/dask/dataframe/io.py in read_hdf(pattern, key, start, stop, columns, chunksize, lock)
    559                                     columns=columns, chunksize=chunksize,
    560                                     lock=lock)
--> 561                    for path in paths])
    562 
    563 

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/dask/dataframe/io.py in <listcomp>(.0)
    559                                     columns=columns, chunksize=chunksize,
    560                                     lock=lock)
--> 561                    for path in paths])
    562 
    563 

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/dask/dataframe/io.py in _read_single_hdf(path, key, start, stop, columns, chunksize, lock)
    499     from .multi import concat
    500     return concat([one_path_one_key(path, k, start, s, columns, chunksize, lock)
--> 501                    for k, s in zip(keys, stops)])
    502 
    503 

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/dask/dataframe/io.py in <listcomp>(.0)
    499     from .multi import concat
    500     return concat([one_path_one_key(path, k, start, s, columns, chunksize, lock)
--> 501                    for k, s in zip(keys, stops)])
    502 
    503 

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/dask/dataframe/io.py in one_path_one_key(path, key, start, stop, columns, chunksize, lock)
    474         not contain any wildcards).
    475         """
--> 476         empty = pd.read_hdf(path, key, stop=0)
    477         if columns is not None:
    478             empty = empty[columns]

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
    328                                  'multiple datasets.')
    329             key = keys[0]
--> 330         return store.select(key, auto_close=auto_close, **kwargs)
    331     except:
    332         # if there is an error, close the store

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    678                            chunksize=chunksize, auto_close=auto_close)
    679 
--> 680         return it.get_result()
    681 
    682     def select_as_coordinates(

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1362 
   1363         # directly return the result
-> 1364         results = self.func(self.start, self.stop, where)
   1365         self.close()
   1366         return results

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    671             return s.read(start=_start, stop=_stop,
    672                           where=_where,
--> 673                           columns=columns, **kwargs)
    674 
    675         # create the iterator

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/io/pytables.py in read(self, where, columns, **kwargs)
   4052 
   4053             block = make_block(values, placement=np.arange(len(cols_)))
-> 4054             mgr = BlockManager([block], [cols_, index_])
   4055             frames.append(DataFrame(mgr))
   4056 

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   2592 
   2593         if do_integrity_check:
-> 2594             self._verify_integrity()
   2595 
   2596         self._consolidate_check()

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/core/internals.py in _verify_integrity(self)
   2802         for block in self.blocks:
   2803             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
-> 2804                 construction_error(tot_items, block.shape[1:], self.axes)
   2805         if len(self.items) != tot_items:
   2806             raise AssertionError('Number of manager items must equal union of '

/home/dav/miniconda3/envs/TAQ/lib/python3.5/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
   3966         raise e
   3967     if block_shape[0] == 0:
-> 3968         raise ValueError("Empty data passed with indices specified.")
   3969     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
   3970         passed, implied))

ValueError: Empty data passed with indices specified.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
DavidLPcommented, Oct 5, 2017

The error is still there in dask 0.15.2 and pandas 0.20.3. This issued should be reopened. Minimum Example:

import urllib
import dask.dataframe as df
import pandas as pd

urllib.urlretrieve("https://github.com/dlab-projects/marketflow/raw/master/test-data/small_test_data_public.h5",
                   "small_test_data_public.h5")
pd.read_hdf(r'small_test_data_public.h5',
            '/IXQAJE/no_suffix')
df.read_hdf(r'small_test_data_public.h5',
            '/IXQAJE/no_suffix')

gives still: ValueError: Empty data passed with indices specified.

1reaction
jackdbdcommented, Jan 5, 2018

I have the same issue.

Here is my environment: Python 3.5.2 pandas (0.21.1) dask (0.16.0)

Note: I replaced urllib.urlretrieve with urllib.request.urlretrieve.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask fails to read file while Pandas not - Stack Overflow
I'm trying to figure it out what's the problem. The file was written by R, which uses utf-8 by default. python · pandas...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
To better facilitate working with datetime data, read_csv() uses the keyword ... If #1 fails, date_parser is called with all the columns concatenated ......
Read more >
dask.dataframe.read_hdf - Dask documentation
This function is like pandas.read_hdf , except it can read from a single large file, or from multiple files, or from multiple keys...
Read more >
Parallel Computing with Dask: A Step-by-Step Tutorial
When you read data using a Dask DataFrame, multiple small pandas ... Even though Dask works with parallel execution, you shouldn't be ...
Read more >
Common Mistakes to Avoid when Using Dask - Coiled
The Dask API follows the pandas API as closely as possible, ... Especially if I'm working on a cloud-based cluster with virtually unlimited ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found