Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: DataFrame.drop_duplicates confuses NULL bytes

See original GitHub issue

Code Sample, a copy-pastable example

import pandas as pd
import pandas.testing as pdt

df = pd.DataFrame({"col": ["", "\0"]})
ser = df["col"].copy()

ser_actual = ser.drop_duplicates()
ser_expected = pd.Series(["", "\0"], name="col")
pdt.assert_series_equal(ser_actual, ser_expected)  # passes

df_actual = df.drop_duplicates()
df_expected = pd.DataFrame({"col": ["", "\0"]})
pdt.assert_frame_equal(df_actual, df_expected)  # fails, only a single row left

Problem description

Test fails, esp. note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates.

Expected Output

Test passes.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.6.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-33-generic machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : 5.4.1 hypothesis : None sphinx : 3.0.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

State:
Created 3 years ago
Comments:12 (9 by maintainers)

Top GitHub Comments

1reaction

simonjayhawkinscommented, Jun 13, 2022

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

there is an open issue to use StringHashTable for value_counts / duplicated with strings #14860 which should address this inconistency.

1reaction

cr-perrycommented, Jan 16, 2022

I’m experiencing the same issue with the creation of a pd.MultiIndex. Given two distinct input values that are identical up to their null character, the index maps them to a single code value and they then both end up getting assigned with the first string value.

My research trail led me to factorize, StringHashTable and (the elusive) kh_get_str. Not sure how to proceed but happy to help (and of course, add my +1 for the issue).

My reproduction:

>>> a = pd.MultiIndex.from_tuples([('test\x001',), ('test\x002',)])
>>> print(a.levels, a.codes)
[['test1']] [[0, 0]]

note: pandas 1.0.1

Top Results From Across the Web

Drop Duplicates dataframe keep first or not empty value

I have a dataframe and want to drop just for a name the double dates and take from the doubles just the first...

What's New — pandas 0.15.0 documentation

Bug in read_html where bytes objects were not tested for in _read (GH7927). ... The DataFrame.drop_duplicates() and DataFrame.duplicated() methods now take ...

pyspark.sql module — PySpark 2.4.7 documentation

Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. Parameters.

A Deep Dive Into Spark Datasets and DataFrames Using Scala

These groups form three regions in the byte Array: the null bit set region, the fixed-length values region, and the variable-length values ...

SparkR.pdf - Microsoft R Application Network

NULL or a character vector giving the row names for the data frame. optional. If TRUE, converting column names is optional.