Add catalog dataset for HDF formats not supported by pandas
See original GitHub issueDescription
HDF files which were not created by pandas.to_hdf
sometimes cannot be read using pandas.read_hdf
. Since Kedro’s HDFDataSet depends on pandas, these files cannot be added to the catalog all.
Context
Currently, the dynamic simulation software employed in my research group outputs exclusively .h5
files which contain information we wish to feed to ML models using Kedro. Currently we use a Python script which converts these HDF files into CSVs so we can track using Kedro, yet this is an inefficient process as we are required to rewrite thousands of files to process them.
Ideally, we would like to add out dataset to our data catalog just like we do with our CSV datasets, but in a way that was able to read any kind of HDF file, unlike kedro.extras.datasets.pandas.HDFDataSet
.
Given pandas cannot read HDF files which do not conform to its specs (apparently by design according to this issue), this simple addition would benefit any user who may store information in HDF files, be it because it is their preferred storage method or (like in our case) they use some software which directly outputs HDF.
Possible Implementation
My research colleague @Eric-OG and I believe we can implement this on our own. I think it’s worth noting it’d be our first contribution to an open source project, but we’ve read the guidelines and so forth.
It would basically involve copying one of the existing datasets (possible even pandas.HDFDataSet
) and adapting it to use another library. We planned on using h5py
.
Possible Alternatives
- Performing part of our data processing pipelines without Kedro; this is cumbersome and can get harder to maintain, especially since our code will likely be used by new researchers next year;
- Converting the files to another type already implemented; this is what we do today but it’s simply inefficient.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (5 by maintainers)
I decided to come back to this issue and finally adapt the tests we needed so we could submit a PR.
All tests I’ve developed in our fork are currently passing, but I don’t know how to get to 100% coverage.
pytest
complains a couple of lines on a parameter dictionary are not covered, and I just don’t know what to do about it. This is the snippet, inkedro.extras.datasets.hdf5.h5py_dataset.__h5py_from_binary
:Another issue I’m facing is I can’t make linting pass. I get this output out of
pre-commit
:I assume our fork being a bit over a year old might be to blame, but I couldn’t find any issue in the GitHub repo related to changes to the configuration files mentioned in the error messages.
cc @Eric-OG
For transparency’s sake, I’ll be upfront and say @Eric-OG and I won’t work on this issue for some time because of other obligations. I think we can be reasonably sure that we’ll have at least opened the PR before the end of the year.
Meanwhile, if anyone’s interested, our fork I linked in this thread has one initial implementation, though saving hasn’t been manually tested thoroughly and we don’t know how to integrate the
save_args
.