question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: DataFrame.drop_duplicates confuses NULL bytes

See original GitHub issue

Code Sample, a copy-pastable example

import pandas as pd
import pandas.testing as pdt

df = pd.DataFrame({"col": ["", "\0"]})
ser = df["col"].copy()

ser_actual = ser.drop_duplicates()
ser_expected = pd.Series(["", "\0"], name="col")
pdt.assert_series_equal(ser_actual, ser_expected)  # passes

df_actual = df.drop_duplicates()
df_expected = pd.DataFrame({"col": ["", "\0"]})
pdt.assert_frame_equal(df_actual, df_expected)  # fails, only a single row left

Problem description

Test fails, esp. note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates.

Expected Output

Test passes.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.6.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-33-generic machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : 5.4.1 hypothesis : None sphinx : 3.0.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
simonjayhawkinscommented, Jun 13, 2022

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

there is an open issue to use StringHashTable for value_counts / duplicated with strings #14860 which should address this inconistency.

1reaction
cr-perrycommented, Jan 16, 2022

I’m experiencing the same issue with the creation of a pd.MultiIndex. Given two distinct input values that are identical up to their null character, the index maps them to a single code value and they then both end up getting assigned with the first string value.

My research trail led me to factorize, StringHashTable and (the elusive) kh_get_str. Not sure how to proceed but happy to help (and of course, add my +1 for the issue).

My reproduction:

>>> a = pd.MultiIndex.from_tuples([('test\x001',), ('test\x002',)])
>>> print(a.levels, a.codes)
[['test1']] [[0, 0]]

note: pandas 1.0.1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Drop Duplicates dataframe keep first or not empty value
I have a dataframe and want to drop just for a name the double dates and take from the doubles just the first...
Read more >
What's New — pandas 0.15.0 documentation
Bug in read_html where bytes objects were not tested for in _read (GH7927). ... The DataFrame.drop_duplicates() and DataFrame.duplicated() methods now take ...
Read more >
pyspark.sql module — PySpark 2.4.7 documentation
Returns a new DataFrame omitting rows with null values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. Parameters.
Read more >
A Deep Dive Into Spark Datasets and DataFrames Using Scala
These groups form three regions in the byte Array: the null bit set region, the fixed-length values region, and the variable-length values ...
Read more >
SparkR.pdf - Microsoft R Application Network
NULL or a character vector giving the row names for the data frame. optional. If TRUE, converting column names is optional.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found