question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rewrite dd.read_hdf

See original GitHub issue

HDF format is probably the most mature cross-platform format in common use today for pandas-like data. Dask.dataframe supports this format through dd.to_hdf and dd.read_hdf functions, which are decently full featured.

However, as various features have glommed onto read_hdf it has become fairly complex, causing significant maintenance burden and additional friction when adding new features. As we think about naturally supporting intelligent partitioning (see #1407), a new and important feature, we might also consider how read_hdf might be rewritten from scratch to improve both the existing situation and ease future development.

It would be nice to now collect both some current frustrations and some long-term desires for dealing with dask.dataframe and HDF files together so that future designs can have these in mind.

cc @nirizr @alberthdev

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:22 (20 by maintainers)

github_iconTop GitHub Comments

2reactions
ankravchcommented, Aug 19, 2016

Hi there, Is there plan for dd.read_hdf to support ‘where’, ‘columns’ keywords by exploiting what’s originally done in pd.read_hdf?

1reaction
mrocklincommented, Aug 13, 2016

My pseudocode above fails to properly divide single large files:

If `*` is somewhere in the inputs:
    filenames, datapaths = find_all_matching_datapaths(filename, datapath)
    filenames, datapaths = sort_by_name(filenames, datapaths)
    dfs = [dd.read_hdf(fn, datapath) for fn, datapath in zip(filenames, datapaths)]
    if not mostly_sorted_divisions(dfs):
        dfs = [df.clear_divisions() for df in dfs]
    return dd.concat(dfs)  # use logic in concat to realign slightly overlapping partitions
else:
    return old_single_datapath_version_of_read_hdf(filename, datapath)

Because we still have the problem of “how do I find rows to cleanly divide up this single file”

This could be resolved if instead of these lines:

filenames, datapaths = find_all_matching_datapaths(filename, datapath)
filenames, datapaths = sort_by_name(filenames, datapaths)
dfs = [dd.read_hdf(fn, datapath) for fn, datapath in zip(filenames, datapaths)]

We had these lines

filenames, datapaths, kwargs_list = find_all_pd_read_hdf_calls(filename, datapath)
filenames, datapaths, kwargs_list = sort_by_name(filenames, datapaths, kwargs_list)
dfs = [dd.read_hdf(fn, datapath, **kwargs) for fn, datapath, kwargs in zip(filenames, datapaths, kwargs_list)]

Where now kwargs_list might optionally include start/stop entries for the pd.read_hdf call. That is we make a new dask.dataframe for every partition, even for partitions within a single file/datapath.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to use HDF5 files in Python
Reading compressed data doesn't change any of the code discussed above. The underlying HDF5 library will extract the data from the compressed ...
Read more >
Create and Store Dask DataFrames
Learn how to create DataFrames and store them. Create a Dask DataFrame from various data storage formats like CSV, HDF, Apache Parquet, and...
Read more >
IDL Scientific Data Formats - L3HarrisGeospatial.com
NUM_DD. Use this keyword to override the machine default for the number of data descriptors to be allocated per DD block. For example:....
Read more >
IDL Scientific Data Formats
NUM_DD. Use this keyword to override the machine default for the number of data descriptors to be allocated per DD block. For example:....
Read more >
Is it safe to use 'dd' to "rejuvinate" a hard drive by setting of=if?
Does this process do what I expect it to do, i.e. read a sector, then immedietly rewrite it? Does this have the desired...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found