Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add catalog dataset for HDF formats not supported by pandas

See original GitHub issue

Description

HDF files which were not created by pandas.to_hdf sometimes cannot be read using pandas.read_hdf. Since Kedro’s HDFDataSet depends on pandas, these files cannot be added to the catalog all.

Context

Currently, the dynamic simulation software employed in my research group outputs exclusively .h5 files which contain information we wish to feed to ML models using Kedro. Currently we use a Python script which converts these HDF files into CSVs so we can track using Kedro, yet this is an inefficient process as we are required to rewrite thousands of files to process them.

Ideally, we would like to add out dataset to our data catalog just like we do with our CSV datasets, but in a way that was able to read any kind of HDF file, unlike kedro.extras.datasets.pandas.HDFDataSet.

Given pandas cannot read HDF files which do not conform to its specs (apparently by design according to this issue), this simple addition would benefit any user who may store information in HDF files, be it because it is their preferred storage method or (like in our case) they use some software which directly outputs HDF.

Possible Implementation

My research colleague @Eric-OG and I believe we can implement this on our own. I think it’s worth noting it’d be our first contribution to an open source project, but we’ve read the guidelines and so forth.

It would basically involve copying one of the existing datasets (possible even pandas.HDFDataSet) and adapting it to use another library. We planned on using h5py.

Possible Alternatives

Performing part of our data processing pipelines without Kedro; this is cumbersome and can get harder to maintain, especially since our code will likely be used by new researchers next year;
Converting the files to another type already implemented; this is what we do today but it’s simply inefficient.

Issue Analytics

State:
Created 2 years ago
Comments:14 (5 by maintainers)

Top GitHub Comments

1reaction

tomaz-sullercommented, Oct 19, 2022

I decided to come back to this issue and finally adapt the tests we needed so we could submit a PR.

All tests I’ve developed in our fork are currently passing, but I don’t know how to get to 100% coverage. pytest complains a couple of lines on a parameter dictionary are not covered, and I just don’t know what to do about it. This is the snippet, in kedro.extras.datasets.hdf5.h5py_dataset.__h5py_from_binary:

file_id_args = {
   "fapl": file_access_property_list,
   "flags": h5py.h5f.ACC_RDONLY,
   "name": next(tempfile._get_candidate_names()).encode(),
}

Another issue I’m facing is I can’t make linting pass. I get this output out of pre-commit:

Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check Yaml...........................................(no files to check)Skipped
Check JSON...........................................(no files to check)Skipped
Check for added large files..............................................Passed
Check for case conflicts.................................................Passed
Check for merge conflicts................................................Passed
Debug Statements (Python)................................................Passed
Fix requirements.txt.................................(no files to check)Skipped
Flake8...................................................................Passed
mypy.....................................................................Passed
blacken-docs.............................................................Passed
pyupgrade................................................................Passed
Sort imports.............................................................Passed
Black....................................................................Passed
Import Linter............................................................Passed
Secret scan..............................................................Passed
Bandit security check....................................................Failed
- hook id: bandit
- exit code: 2

[main]  INFO    profile include tests: None
[main]  INFO    profile exclude tests: None
[main]  INFO    cli include tests: None
[main]  INFO    cli exclude tests: None
[main]  INFO    running on Python 3.7.13
[main]  ERROR   No tests would be run, please check the profile.

Quick Pylint on kedro/*..................................................Passed
Quick Pylint on features/*...........................(no files to check)Skipped
Quick Pylint on tests/*..................................................Failed
- hook id: pylint-quick-tests
- exit code: 26

************* Module /home/tomaz/git/IC/kedro/pyproject.toml
pyproject.toml:1:0: R0022: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see https://github.com/PyCQA/pylint/pull/3571. (useless-option-value)
************* Module Command line
Command line:1:0: R0022: Useless option value for '--disable', 'no-self-use' was moved to an optional extension, see https://pylint.pycqa.org/en/latest/whatsnew/2/2.14/summary.html#removed-checkers. (useless-option-value)
************* Module test_h5py_dataset
tests/extras/datasets/hdf5/test_h5py_dataset.py:12:0: E0401: Unable to import 'kedro.extras.datasets.hdf5.h5py_dataset' (import-error)
tests/extras/datasets/hdf5/test_h5py_dataset.py:13:0: E0401: Unable to import 'kedro.io' (import-error)
tests/extras/datasets/hdf5/test_h5py_dataset.py:14:0: E0401: Unable to import 'kedro.io.core' (import-error)
tests/extras/datasets/hdf5/test_h5py_dataset.py:156:17: C2801: Unnecessarily calls dunder method __enter__. Invoke context manager directly. (unnecessary-dunder-call)
tests/extras/datasets/hdf5/test_h5py_dataset.py:151:0: I0021: Useless suppression of 'no-member' (useless-suppression)

------------------------------------------------------------------
Your code has been rated at 8.79/10 (previous run: 5.68/10, +3.11)

I assume our fork being a bit over a year old might be to blame, but I couldn’t find any issue in the GitHub repo related to changes to the configuration files mentioned in the error messages.

cc @Eric-OG

1reaction

tomaz-sullercommented, Nov 23, 2021

For transparency’s sake, I’ll be upfront and say @Eric-OG and I won’t work on this issue for some time because of other obligations. I think we can be reasonably sure that we’ll have at least opened the PR before the end of the year.

Meanwhile, if anyone’s interested, our fork I linked in this thread has one initial implementation, though saving hasn’t been manually tested thoroughly and we don’t know how to integrate the save_args.

Top Results From Across the Web

pandas.DataFrame.to_hdf — pandas 1.5.2 documentation

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One ......

HDF5 file format with Pandas - Numpy Ninja

HDF5 is a data format that stores and manages large and complex data with lesser disk space and faster retrieve. While reading and...

kedro.extras.datasets.pandas.hdf_dataset - Read the Docs

"""``HDFDataSet`` loads/saves data from/to a hdf file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas.HDFStore to handle the hdf file....

Pandas DataFrame: to_hdf() function - w3resource

DataFrame - to_hdf() function ; append, For Table formats, append the input data to the existing. bool. Default Value: False ; data_columns, List ......

Loading Data — ADS 1.0.0 documentation

Hierarchical Data Format 5 (HDF5) ... ADS Does Not Support: ... import pandas as pd import numpy as np from ads.dataset.factory import DatasetFactory...