read_feather method does not restore a dataframe's table-level user-defined properties (attributes or tags or metadata)
See original GitHub issueI create a property named someid (it is NOT a column) in a pandas dataframe named df and assign it a value:
df.someid = 24
I make sure the property is there:
print(df.someid) #prints 24
I save the dataframe to a feather file:
df.to_feather("C:/pandas/DataLoss.feather")
I read the feather file back into a dataframe:
df = pd.read_feather("C:/pandas/DataLoss.feather")
I try to retrieve the property from the dataframe:
print(df.someid)
And I do not get its value (24) back. Instead I get this error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-99-42d4540986ab> in <module>()
----> 1 print(df.someid)
~\Anaconda3\envs\env_feather\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
3079 if name in self._info_axis:
3080 return self[name]
-> 3081 return object.__getattribute__(self, name)
3082
3083 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'someid'
- This may well be by design, although afaik undocumented, but it seems to me that an error should be generated whenever user data is lost–if the feather format is intended to be the native storage format for a pandas dataframe. (However, the inability to store indexes is documented; fwiw, I wish that limitation weren’t present either.)
- Or it may be there is a newer, better way of specifying user properties (tags / metadata) that I don’t know about, in which case perhaps the old method should give a warning and be deprecated.
- Note that I could store the property as a column (as a repeating constant), but that uses up significant space as the number of rows increases (as the value repeats on every row)–even if stored as a categorical. That wouldn’t be an issue if there were compression, but there appears not to be compression. I could use parquet format instead, and perhaps in the latest release that would be recommended over feather format. (I haven’t seen a good comparison of the pros and cons of each.)
Thanks for any assistance or comments. (–Apologies in advance if I missed something in the docs or online; I didn’t see anything relevant after a reasonable search, except the unapproved use of the undocumented _metadata.)
I realize there are thorny, perhaps unresolvable, problems with dataframe table-level metadata propagation (e.g. how to handle vertical dataframe concatenations), but I think what happens (by design) should be documented. As it is, I don’t know what is supposed to be happening by design here.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top GitHub Comments
Not even pickle preserves these properties when using pandas, I don’t think this is something we intend to fix in generality:
(just fyi: some useful cross-references: Looks like the “lost attribute” issue popped up as early as 2012: https://stackoverflow.com/questions/13250499/attributes-to-a-subclass-of-pandas-dataframe-disappear-after-pickle
One of the more comprehensive discussions I’ve found: https://github.com/pandas-dev/pandas/issues/2485 )