Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: DataFrame.attrs are lost when writing to HDF5

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas. Version 1.0.3
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

In [9]: df = pd.DataFrame(index=[1, 2, 3], columns=list("abcde"), data=np.ones((3,5)))                                                  

In [10]: df.attrs["foo"] = "bar"  

In [11]: df.to_hdf("test_df.h5", key="key")                         

In [12]: df_from_h5 = pd.read_hdf("test_df.h5")                     

In [13]: assert df.attrs == df_from_h5.attrs, "attrs have gone"     
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-1fab1bc115de> in <module>
----> 1 assert df.attrs == df_from_h5.attrs, "attrs have gone"

AssertionError: attrs have gone

Problem description

The metadata stored in attributes is gone after the DataFrame was read back from disk. I understand the attrs dict is WIP. I hope this issue will help to move this forward! Thanks!

Related: #29062

Expected Output

The attrs should be the same as in the DataFrame written to disk.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 5.6.15-arch1-1 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : de_DE.utf8 LOCALE : de_DE.UTF-8

pandas : 1.0.3 numpy : 1.18.4 pytz : 2020.1 dateutil : 2.8.1 pip : 20.0.2 setuptools : 47.1.1 Cython : 0.29.19 pytest : 5.4.2 hypothesis : None sphinx : 3.0.4 blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.1 matplotlib : 3.2.1 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.2 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.17 tables : 3.6.1 tabulate : None xarray : 0.15.2.dev47+g33a66d63 xlrd : None xlwt : None xlsxwriter : None numba : 0.49.1

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Jun 5, 2020

It might be worth looking at how xarray handles these. Ideally we would be compatible with how / where they store metadata.

0reactions

janoshcommented, Aug 25, 2022

The same is true for to_json btw (and probably all serialization methods?).


from datetime import datetime

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()

today = f"{datetime.now():%Y-%m-%d}"
df.attrs["created_at"] = today
df.to_json("test.json")

df_from_json = pd.read_json("test.json")

assert df.attrs == df_from_json.attrs, f"{df_from_json.attrs = }"
>>> AssertionError: df_from_json.attrs = {}

attrs will be a tremendously useful feature once it gets better permanence.

Top Results From Across the Web

Values missing when loaded from Pandas HDF5 file

I plan to use Python 3.3.x mainly for this unicode column support in PyTables (Am I wrong?) but could not store all the...

Automatic detection of HDF5 dataset identifier fails when ...

We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that...

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

The corresponding writer functions are object methods that are accessed like ... Detect missing value markers (empty strings and the value of na_values)....

How to use HDF5 files in Python

We import the packages h5py and numpy and create an array with random values. We open a file called random.hdf5 with write permission,...

10.9 HDF5 (PyTables) — Pandas Doc - GitHub Pages

There is a PyTables indexing bug which may appear when querying stores using an index. If you see a subset of results being...