Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rewrite dd.read_hdf

See original GitHub issue

HDF format is probably the most mature cross-platform format in common use today for pandas-like data. Dask.dataframe supports this format through dd.to_hdf and dd.read_hdf functions, which are decently full featured.

However, as various features have glommed onto read_hdf it has become fairly complex, causing significant maintenance burden and additional friction when adding new features. As we think about naturally supporting intelligent partitioning (see #1407), a new and important feature, we might also consider how read_hdf might be rewritten from scratch to improve both the existing situation and ease future development.

It would be nice to now collect both some current frustrations and some long-term desires for dealing with dask.dataframe and HDF files together so that future designs can have these in mind.

cc @nirizr @alberthdev

Issue Analytics

State:
Created 7 years ago
Comments:22 (20 by maintainers)

Top GitHub Comments

2reactions

ankravchcommented, Aug 19, 2016

Hi there, Is there plan for dd.read_hdf to support ‘where’, ‘columns’ keywords by exploiting what’s originally done in pd.read_hdf?

1reaction

mrocklincommented, Aug 13, 2016

My pseudocode above fails to properly divide single large files:

If `*` is somewhere in the inputs:
    filenames, datapaths = find_all_matching_datapaths(filename, datapath)
    filenames, datapaths = sort_by_name(filenames, datapaths)
    dfs = [dd.read_hdf(fn, datapath) for fn, datapath in zip(filenames, datapaths)]
    if not mostly_sorted_divisions(dfs):
        dfs = [df.clear_divisions() for df in dfs]
    return dd.concat(dfs)  # use logic in concat to realign slightly overlapping partitions
else:
    return old_single_datapath_version_of_read_hdf(filename, datapath)

Because we still have the problem of “how do I find rows to cleanly divide up this single file”

This could be resolved if instead of these lines:

filenames, datapaths = find_all_matching_datapaths(filename, datapath)
filenames, datapaths = sort_by_name(filenames, datapaths)
dfs = [dd.read_hdf(fn, datapath) for fn, datapath in zip(filenames, datapaths)]

We had these lines

filenames, datapaths, kwargs_list = find_all_pd_read_hdf_calls(filename, datapath)
filenames, datapaths, kwargs_list = sort_by_name(filenames, datapaths, kwargs_list)
dfs = [dd.read_hdf(fn, datapath, **kwargs) for fn, datapath, kwargs in zip(filenames, datapaths, kwargs_list)]

Where now kwargs_list might optionally include start/stop entries for the pd.read_hdf call. That is we make a new dask.dataframe for every partition, even for partitions within a single file/datapath.