question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.NA TypeError in drop_duplicates with object dtype

See original GitHub issue

Code Sample, a copy-pastable example if possible

>>> pd.DataFrame([[1, pd.NA], [2, "a"]]).drop_duplicates()
Traceback (most recent call last):
   ...
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/frame.py", line 4859, in f
    labels, shape = algorithms.factorize(
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/algorithms.py", line 629, in factorize
    codes, uniques = _factorize_array(
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/algorithms.py", line 478, in _factorize_array
    uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1806, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 1728, in pandas._libs.hashtable.PyObjectHashTable._unique
  File "pandas/_libs/missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

This same failure isn’t present when using an extension type:

>>> df = pd.DataFrame([[1, pd.NA], [2, "a"]], columns=list("ab"))
>>> df["b"] = df["b"].astype("string")
>>> df.drop_duplicates()
   a     b
0  1  <NA>
1  2     a

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
AnnaDagliscommented, Mar 26, 2020

I cannot reproduce the error. Has it been fixed already?

0reactions
simonjayhawkinscommented, Apr 23, 2020

Thanks @AnnaDaglis and well spotted. @jbrockmendel any idea where this might have been fixed?

fixed in #31939 (i.e. 1.0.2)

41bc226841eb59ccdfa279734dac98f7debc6249 is the first new commit commit 41bc226841eb59ccdfa279734dac98f7debc6249 Author: Daniel Saxton 2658661+dsaxton@users.noreply.github.com Date: Sun Feb 23 08:57:07 2020 -0600

BUG: Fix construction of Categorical from pd.NA (#31939)
Read more comments on GitHub >

github_iconTop Results From Across the Web

TypeError: type object argument after * must be a sequence ...
Is it possible to use the drop_duplicates method in Pandas to remove duplicate rows based on a column id where the values contain...
Read more >
Pandas Drop Duplicates – pd.df.drop_duplicates()
Pandas Drop Duplicates - .drop_duplicates() looks through your DataFrame and drops any duplicate rows or rows with duplicate column subsets.
Read more >
pandas.DataFrame.drop_duplicates
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored. Parameters. subsetcolumn ...
Read more >
pandas.core.series — Lux 0.1.2 documentation
The object supports both integer- and label-based indexing and provides a ... s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers') >>> s 0 1...
Read more >
Manipulating DataFrames with Pandas - Trenton McKinney
import pandas as pd import matplotlib.pyplot as plt import numpy as np from ... 1308 non-null float64 9 embarked 1307 non-null object dtypes:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found