Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.NA TypeError in drop_duplicates with object dtype

See original GitHub issue

Code Sample, a copy-pastable example if possible

>>> pd.DataFrame([[1, pd.NA], [2, "a"]]).drop_duplicates()
Traceback (most recent call last):
   ...
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/frame.py", line 4859, in f
    labels, shape = algorithms.factorize(
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/algorithms.py", line 629, in factorize
    codes, uniques = _factorize_array(
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/algorithms.py", line 478, in _factorize_array
    uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1806, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 1728, in pandas._libs.hashtable.PyObjectHashTable._unique
  File "pandas/_libs/missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

This same failure isn’t present when using an extension type:

>>> df = pd.DataFrame([[1, pd.NA], [2, "a"]], columns=list("ab"))
>>> df["b"] = df["b"].astype("string")
>>> df.drop_duplicates()
   a     b
0  1  <NA>
1  2     a

Issue Analytics

State:
Created 3 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

AnnaDagliscommented, Mar 26, 2020

I cannot reproduce the error. Has it been fixed already?

0reactions

simonjayhawkinscommented, Apr 23, 2020

Thanks @AnnaDaglis and well spotted. @jbrockmendel any idea where this might have been fixed?

fixed in #31939 (i.e. 1.0.2)

41bc226841eb59ccdfa279734dac98f7debc6249 is the first new commit commit 41bc226841eb59ccdfa279734dac98f7debc6249 Author: Daniel Saxton 2658661+dsaxton@users.noreply.github.com Date: Sun Feb 23 08:57:07 2020 -0600

BUG: Fix construction of Categorical from pd.NA (#31939)

Top Results From Across the Web

TypeError: type object argument after * must be a sequence ...

Is it possible to use the drop_duplicates method in Pandas to remove duplicate rows based on a column id where the values contain...

Pandas Drop Duplicates – pd.df.drop_duplicates()

Pandas Drop Duplicates - .drop_duplicates() looks through your DataFrame and drops any duplicate rows or rows with duplicate column subsets.

pandas.DataFrame.drop_duplicates

Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored. Parameters. subsetcolumn ...

pandas.core.series — Lux 0.1.2 documentation

The object supports both integer- and label-based indexing and provides a ... s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers') >>> s 0 1...

Manipulating DataFrames with Pandas - Trenton McKinney

import pandas as pd import matplotlib.pyplot as plt import numpy as np from ... 1308 non-null float64 9 embarked 1307 non-null object dtypes:...