Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_feather method does not restore a dataframe's table-level user-defined properties (attributes or tags or metadata)

See original GitHub issue

I create a property named someid (it is NOT a column) in a pandas dataframe named df and assign it a value: df.someid = 24 I make sure the property is there: print(df.someid) #prints 24 I save the dataframe to a feather file: df.to_feather("C:/pandas/DataLoss.feather") I read the feather file back into a dataframe: df = pd.read_feather("C:/pandas/DataLoss.feather") I try to retrieve the property from the dataframe: print(df.someid) And I do not get its value (24) back. Instead I get this error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-99-42d4540986ab> in <module>()
----> 1 print(df.someid)

~\Anaconda3\envs\env_feather\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'someid'

This may well be by design, although afaik undocumented, but it seems to me that an error should be generated whenever user data is lost–if the feather format is intended to be the native storage format for a pandas dataframe. (However, the inability to store indexes is documented; fwiw, I wish that limitation weren’t present either.)
Or it may be there is a newer, better way of specifying user properties (tags / metadata) that I don’t know about, in which case perhaps the old method should give a warning and be deprecated.
Note that I could store the property as a column (as a repeating constant), but that uses up significant space as the number of rows increases (as the value repeats on every row)–even if stored as a categorical. That wouldn’t be an issue if there were compression, but there appears not to be compression. I could use parquet format instead, and perhaps in the latest release that would be recommended over feather format. (I haven’t seen a good comparison of the pros and cons of each.)

Thanks for any assistance or comments. (–Apologies in advance if I missed something in the docs or online; I didn’t see anything relevant after a reasonable search, except the unapproved use of the undocumented _metadata.)

I realize there are thorny, perhaps unresolvable, problems with dataframe table-level metadata propagation (e.g. how to handle vertical dataframe concatenations), but I think what happens (by design) should be documented. As it is, I don’t know what is supposed to be happening by design here.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

wesmcommented, Nov 21, 2017

Not even pickle preserves these properties when using pandas, I don’t think this is something we intend to fix in generality:

In [1]: import pandas as pd

In [2]: df = pd.util.testing.makeDataFrame()

In [3]: df.someid = 24

In [4]: import pickle

In [5]: pickle.loads(pickle.dumps(df)).someid
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-71102f1fb31f> in <module>()
----> 1 pickle.loads(pickle.dumps(df)).someid

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'someid'

0reactions

techvslifecommented, Nov 22, 2017

(just fyi: some useful cross-references: Looks like the “lost attribute” issue popped up as early as 2012: https://stackoverflow.com/questions/13250499/attributes-to-a-subclass-of-pandas-dataframe-disappear-after-pickle

One of the more comprehensive discussions I’ve found: https://github.com/pandas-dev/pandas/issues/2485 )

Top Results From Across the Web

Why are attributes lost after copying a Pandas DataFrame

Using df.copy(deep=True) doesn't alleviate the issue either. I now understand that the metadata copy is not implemented, but I would want ...

Saving Metadata with DataFrames - Towards Data Science

In the following solution we will first use Arrow to convert a DataFrame to an Arrow table and then attach metadata. This enriched...

How to add metadata to a DataFrame or Series with Pandas in ...

We can get metadata simply by using info() command; We can add metadata to the existing data and can view the metadata of...

How 'user_defined' and 'system' attribute propety are different?

'user_defined' attribute property tells if attribute is custom (define by developer) or not. It means, if 'system'=0, the attribute is ' ...

Indexing and Selecting Data — pandas 0.15.0 documentation

Integers are valid labels, but they refer to the label and not the position. The .loc attribute is the primary access method. The...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

read_feather method does not restore a dataframe's table-level user-defined properties (attributes or tags or metadata)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Memory error when using feather with tibble 1.4.99.9004

Python Feather Breaks on Files with More Than 268,434,943 rows