Rewrite dd.read_hdf
See original GitHub issueHDF format is probably the most mature cross-platform format in common use today for pandas-like data. Dask.dataframe supports this format through dd.to_hdf
and dd.read_hdf
functions, which are decently full featured.
However, as various features have glommed onto read_hdf
it has become fairly complex, causing significant maintenance burden and additional friction when adding new features. As we think about naturally supporting intelligent partitioning (see #1407), a new and important feature, we might also consider how read_hdf
might be rewritten from scratch to improve both the existing situation and ease future development.
It would be nice to now collect both some current frustrations and some long-term desires for dealing with dask.dataframe and HDF files together so that future designs can have these in mind.
Issue Analytics
- State:
- Created 7 years ago
- Comments:22 (20 by maintainers)
Top GitHub Comments
Hi there, Is there plan for dd.read_hdf to support ‘where’, ‘columns’ keywords by exploiting what’s originally done in pd.read_hdf?
My pseudocode above fails to properly divide single large files:
Because we still have the problem of “how do I find rows to cleanly divide up this single file”
This could be resolved if instead of these lines:
We had these lines
Where now
kwargs_list
might optionally include start/stop entries for thepd.read_hdf
call. That is we make a new dask.dataframe for every partition, even for partitions within a single file/datapath.