question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: DataFrame.attrs are lost when writing to HDF5

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas. Version 1.0.3

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In [9]: df = pd.DataFrame(index=[1, 2, 3], columns=list("abcde"), data=np.ones((3,5)))                                                  

In [10]: df.attrs["foo"] = "bar"  

In [11]: df.to_hdf("test_df.h5", key="key")                         

In [12]: df_from_h5 = pd.read_hdf("test_df.h5")                     

In [13]: assert df.attrs == df_from_h5.attrs, "attrs have gone"     
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-1fab1bc115de> in <module>
----> 1 assert df.attrs == df_from_h5.attrs, "attrs have gone"

AssertionError: attrs have gone

Problem description

The metadata stored in attributes is gone after the DataFrame was read back from disk. I understand the attrs dict is WIP. I hope this issue will help to move this forward! Thanks!

Related: #29062

Expected Output

The attrs should be the same as in the DataFrame written to disk.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 5.6.15-arch1-1 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : de_DE.utf8 LOCALE : de_DE.UTF-8

pandas : 1.0.3 numpy : 1.18.4 pytz : 2020.1 dateutil : 2.8.1 pip : 20.0.2 setuptools : 47.1.1 Cython : 0.29.19 pytest : 5.4.2 hypothesis : None sphinx : 3.0.4 blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.1 matplotlib : 3.2.1 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.2 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.17 tables : 3.6.1 tabulate : None xarray : 0.15.2.dev47+g33a66d63 xlrd : None xlwt : None xlsxwriter : None numba : 0.49.1

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:2
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Jun 5, 2020

It might be worth looking at how xarray handles these. Ideally we would be compatible with how / where they store metadata.

0reactions
janoshcommented, Aug 25, 2022

The same is true for to_json btw (and probably all serialization methods?).


from datetime import datetime

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()

today = f"{datetime.now():%Y-%m-%d}"
df.attrs["created_at"] = today
df.to_json("test.json")

df_from_json = pd.read_json("test.json")

assert df.attrs == df_from_json.attrs, f"{df_from_json.attrs = }"
>>> AssertionError: df_from_json.attrs = {}

attrs will be a tremendously useful feature once it gets better permanence.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Values missing when loaded from Pandas HDF5 file
I plan to use Python 3.3.x mainly for this unicode column support in PyTables (Am I wrong?) but could not store all the...
Read more >
Automatic detection of HDF5 dataset identifier fails when ...
We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
The corresponding writer functions are object methods that are accessed like ... Detect missing value markers (empty strings and the value of na_values)....
Read more >
How to use HDF5 files in Python
We import the packages h5py and numpy and create an array with random values. We open a file called random.hdf5 with write permission,...
Read more >
10.9 HDF5 (PyTables) — Pandas Doc - GitHub Pages
There is a PyTables indexing bug which may appear when querying stores using an index. If you see a subset of results being...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found