question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_feather method does not restore a dataframe's table-level user-defined properties (attributes or tags or metadata)

See original GitHub issue

I create a property named someid (it is NOT a column) in a pandas dataframe named df and assign it a value: df.someid = 24 I make sure the property is there: print(df.someid) #prints 24 I save the dataframe to a feather file: df.to_feather("C:/pandas/DataLoss.feather") I read the feather file back into a dataframe: df = pd.read_feather("C:/pandas/DataLoss.feather") I try to retrieve the property from the dataframe: print(df.someid) And I do not get its value (24) back. Instead I get this error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-99-42d4540986ab> in <module>()
----> 1 print(df.someid)

~\Anaconda3\envs\env_feather\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'someid'
  1. This may well be by design, although afaik undocumented, but it seems to me that an error should be generated whenever user data is lost–if the feather format is intended to be the native storage format for a pandas dataframe. (However, the inability to store indexes is documented; fwiw, I wish that limitation weren’t present either.)
  2. Or it may be there is a newer, better way of specifying user properties (tags / metadata) that I don’t know about, in which case perhaps the old method should give a warning and be deprecated.
  3. Note that I could store the property as a column (as a repeating constant), but that uses up significant space as the number of rows increases (as the value repeats on every row)–even if stored as a categorical. That wouldn’t be an issue if there were compression, but there appears not to be compression. I could use parquet format instead, and perhaps in the latest release that would be recommended over feather format. (I haven’t seen a good comparison of the pros and cons of each.)

Thanks for any assistance or comments. (–Apologies in advance if I missed something in the docs or online; I didn’t see anything relevant after a reasonable search, except the unapproved use of the undocumented _metadata.)

I realize there are thorny, perhaps unresolvable, problems with dataframe table-level metadata propagation (e.g. how to handle vertical dataframe concatenations), but I think what happens (by design) should be documented. As it is, I don’t know what is supposed to be happening by design here.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
wesmcommented, Nov 21, 2017

Not even pickle preserves these properties when using pandas, I don’t think this is something we intend to fix in generality:

In [1]: import pandas as pd

In [2]: df = pd.util.testing.makeDataFrame()

In [3]: df.someid = 24

In [4]: import pickle

In [5]: pickle.loads(pickle.dumps(df)).someid
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-71102f1fb31f> in <module>()
----> 1 pickle.loads(pickle.dumps(df)).someid

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'someid'
0reactions
techvslifecommented, Nov 22, 2017

(just fyi: some useful cross-references: Looks like the “lost attribute” issue popped up as early as 2012: https://stackoverflow.com/questions/13250499/attributes-to-a-subclass-of-pandas-dataframe-disappear-after-pickle

One of the more comprehensive discussions I’ve found: https://github.com/pandas-dev/pandas/issues/2485 )

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why are attributes lost after copying a Pandas DataFrame
Using df.copy(deep=True) doesn't alleviate the issue either. I now understand that the metadata copy is not implemented, but I would want ...
Read more >
Saving Metadata with DataFrames - Towards Data Science
In the following solution we will first use Arrow to convert a DataFrame to an Arrow table and then attach metadata. This enriched...
Read more >
How to add metadata to a DataFrame or Series with Pandas in ...
We can get metadata simply by using info() command; We can add metadata to the existing data and can view the metadata of...
Read more >
How 'user_defined' and 'system' attribute propety are different?
'user_defined' attribute property tells if attribute is custom (define by developer) or not. It means, if 'system'=0, the attribute is ' ...
Read more >
Indexing and Selecting Data — pandas 0.15.0 documentation
Integers are valid labels, but they refer to the label and not the position. The .loc attribute is the primary access method. The...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found