question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: pd.NA doesn't pickle/unpickle faithfully

See original GitHub issue

Code Sample, a copy-pastable example if possible


In [5]: df['Gold Categories'].count()
Out[5]: 135218

In [6]: df['Gold Categories'].isna().sum()
Out[6]: 0

In [7]: df['Gold Categories'].iloc[256]
Out[7]: <NA>

In [8]: pd.isna(df['Gold Categories'].iloc[256])
Out[8]: False

In [9]: type(df['Gold Categories'].iloc[256])
Out[9]: pandas._libs.missing.NAType

In [10]: pd.__version__
Out[10]: '1.0.1'


Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.3.16-200.fc30.x86_64 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : nb_NO.UTF-8 LOCALE : nb_NO.UTF-8

pandas : 1.0.1 numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 41.6.0.post20191030 Cython : 0.29.13 pytest : 5.2.2 hypothesis : None sphinx : 2.2.1 blosc : None feather : None xlsxwriter : 1.2.2 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.9.0 pandas_datareader: None bs4 : 4.8.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 2.2.3 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.0 pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.2.2 pyxlsb : None s3fs : None scipy : 1.3.1 sqlalchemy : 1.3.10 tables : 3.5.2 tabulate : 0.8.5 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.2 numba : 0.46.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
jorisvandenbosschecommented, Feb 11, 2020

When pickling/unpickling, I can reproduce this:

In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})                                                                                                                                                             

In [41]: s.isna()                                                                                                                                                                                                  
Out[41]: 
268    False
269     True
dtype: bool

In [42]: s.to_pickle('test_na_pickle.pkl')                                                                                                                                                                         

In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')                                                                                                                                                                 

In [44]: s2.isna()                                                                                                                                                                                                 
Out[44]: 
268    False
269    False
dtype: bool

In [45]: type(s2.values[1])                                                                                                                                                                                        
Out[45]: pandas._libs.missing.NAType

In [46]: s2.values[1] is pd.NA                                                                                                                                                                                     
Out[46]: False

So apparently, when unpickling, it doesn’t return the same singleton.

0reactions
mephphcommented, Feb 19, 2020

A simple example of the problem:

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]]).to_pickle('na_problem.pkl')

In [3]: df = pd.read_pickle('na_problem.pkl')

In [4]: df.isna()

Out[4]: 
       0
0  False

In [5]: id(df.loc[0, 0]), id(pd.NA)

Out[5]: (140393643089760, 140393944655632)

This can also cause exceptions when working with dtypes other than object.

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]], dtype='string').to_pickle('na_problem.pkl')

In [3]: pd.read_pickle('na_problem.pkl').head()
Out[3]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
... removed for brevity
/home/mephph/.local/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _validate(self)
    168         """Validate that we only store NA or strings."""
    169         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 170             raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    171         if self._ndarray.dtype != "object":
    172             raise ValueError(

ValueError: StringArray requires a sequence of strings or pandas.NA

The following function replaces the incorrect NA values in place. It operates one column at-a-time to preserve dtypes. flake8 complains about comparing types rather than using isinstance, but I find this easier to read.

def fix_wrong_na(df):
    for column in df.columns:
        isna_mask = df[column].apply(type) == type(pd.NA)
        df[column][isna_mask] = pd.NA
Read more comments on GitHub >

github_iconTop Results From Across the Web

can't unpickle class that inherits from pandas DataFrame
am I doing something wrong, or is this a bug? import pandas as pd import pickle class Foo(pd.DataFrame): def __ ...
Read more >
What's New — pandas 0.23.0 documentation - PyData |
Pandas 0.22.0 changes the handling of empty and all-NA sums and products. ... Bug in pickle compat prior to the v0.20.x series, when...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found