Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: pd.NA doesn't pickle/unpickle faithfully

See original GitHub issue

Code Sample, a copy-pastable example if possible


In [5]: df['Gold Categories'].count()
Out[5]: 135218

In [6]: df['Gold Categories'].isna().sum()
Out[6]: 0

In [7]: df['Gold Categories'].iloc[256]
Out[7]: <NA>

In [8]: pd.isna(df['Gold Categories'].iloc[256])
Out[8]: False

In [9]: type(df['Gold Categories'].iloc[256])
Out[9]: pandas._libs.missing.NAType

In [10]: pd.__version__
Out[10]: '1.0.1'

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.3.16-200.fc30.x86_64 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : nb_NO.UTF-8 LOCALE : nb_NO.UTF-8

pandas : 1.0.1 numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 41.6.0.post20191030 Cython : 0.29.13 pytest : 5.2.2 hypothesis : None sphinx : 2.2.1 blosc : None feather : None xlsxwriter : 1.2.2 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.9.0 pandas_datareader: None bs4 : 4.8.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 2.2.3 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.0 pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.2.2 pyxlsb : None s3fs : None scipy : 1.3.1 sqlalchemy : 1.3.10 tables : 3.5.2 tabulate : 0.8.5 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.2 numba : 0.46.0

Issue Analytics

State:
Created 4 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

3reactions

jorisvandenbosschecommented, Feb 11, 2020

When pickling/unpickling, I can reproduce this:

In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})                                                                                                                                                             

In [41]: s.isna()                                                                                                                                                                                                  
Out[41]: 
268    False
269     True
dtype: bool

In [42]: s.to_pickle('test_na_pickle.pkl')                                                                                                                                                                         

In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')                                                                                                                                                                 

In [44]: s2.isna()                                                                                                                                                                                                 
Out[44]: 
268    False
269    False
dtype: bool

In [45]: type(s2.values[1])                                                                                                                                                                                        
Out[45]: pandas._libs.missing.NAType

In [46]: s2.values[1] is pd.NA                                                                                                                                                                                     
Out[46]: False

So apparently, when unpickling, it doesn’t return the same singleton.

0reactions

mephphcommented, Feb 19, 2020

A simple example of the problem:

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]]).to_pickle('na_problem.pkl')

In [3]: df = pd.read_pickle('na_problem.pkl')

In [4]: df.isna()

Out[4]: 
       0
0  False

In [5]: id(df.loc[0, 0]), id(pd.NA)

Out[5]: (140393643089760, 140393944655632)

This can also cause exceptions when working with dtypes other than object.

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]], dtype='string').to_pickle('na_problem.pkl')

In [3]: pd.read_pickle('na_problem.pkl').head()
Out[3]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
... removed for brevity
/home/mephph/.local/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _validate(self)
    168         """Validate that we only store NA or strings."""
    169         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 170             raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    171         if self._ndarray.dtype != "object":
    172             raise ValueError(

ValueError: StringArray requires a sequence of strings or pandas.NA

The following function replaces the incorrect NA values in place. It operates one column at-a-time to preserve dtypes. flake8 complains about comparing types rather than using isinstance, but I find this easier to read.

def fix_wrong_na(df):
    for column in df.columns:
        isna_mask = df[column].apply(type) == type(pd.NA)
        df[column][isna_mask] = pd.NA

Top Results From Across the Web

can't unpickle class that inherits from pandas DataFrame

am I doing something wrong, or is this a bug? import pandas as pd import pickle class Foo(pd.DataFrame): def __ ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

BUG: pd.NA doesn't pickle/unpickle faithfully

Code Sample, a copy-pastable example if possible

Output of `pd.show_versions()`

INSTALLED VERSIONS

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Assigned conversion via loc to Int64 fails under peculiar conditions

pandas 1.0.1 read_csv() is broken for some file-like objects

BUG: pd.NA doesn't pickle/unpickle faithfully

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

INSTALLED VERSIONS

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Assigned conversion via loc to Int64 fails under peculiar conditions

pandas 1.0.1 read_csv() is broken for some file-like objects

Output of `pd.show_versions()`