question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: set_index on more than 1 column changes boolean values

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# create df with booleans and 0,1
df = pd.DataFrame({'group_a': [1,2,1,2], 'group_b': [0,1,True,False], 'value':range(4)})

print(df)

# True and False were changed
print(df.set_index(['group_a', 'group_b']))

# Problem doesn't happen if 0/1 are not present in the column
df = pd.DataFrame({'group_a': [1,2,1,2], 'group_b': [2,3,True,False], 'value':range(4)})

print(df)

print(df.set_index(['group_a', 'group_b']))

Issue Description

set_index on more than 1 column may change the values of booleans to integers/floats.

this issue only happens if (0, 0.0, 1, 1.0) are found in the same column as (True/False).

Expected Behavior

set_index should not change values

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.8.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.4.0 Version : Darwin Kernel Version 20.4.0: Fri Mar 5 01:14:14 PST 2021; root:xnu-7195.101.1~3/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 1.4.2 numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.3.0 Cython : 0.29.28 pytest : 6.2.5 hypothesis : None sphinx : None blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : 8.3.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : 2022.3.0 gcsfs : None markupsafe : 2.1.1 matplotlib : 3.5.2 numba : 0.53.1 numexpr : 2.8.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyreadstat : None pyxlsb : None s3fs : 2022.02.0 scipy : 1.8.0 snappy : None sqlalchemy : None tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
alonmecommented, Nov 23, 2022

OK.

Would gladly help to find more places in which we have this behavior / implement a fix

0reactions
rhshadrachcommented, Nov 23, 2022

I think this needs more discussion; my current thought is that while this behavior is undesirable, modifying it may be more undesirable.

Another example. Trying to index using a Boolean value is specifically handled by pandas, but float is not.

df = pd.DataFrame({'a': [1, 0, 2], 'b': [3, 4, 5]}).set_index('a')

print(df)
#    b
# a   
# 1  3
# 0  4
# 2  5

print(df.loc[True])
# KeyError: 'True: boolean label can not be used without a boolean index'

print(df.loc[1.0])
# b    3
# Name: 1, dtype: int64
Read more comments on GitHub >

github_iconTop Results From Across the Web

BUG: can't use .loc with boolean values in MultiIndex #47687
Issue Description. The above snippet creates a data frame with two boolean columns. When just one of those columns is set as the...
Read more >
MultiIndex / advanced indexing
IndexSlice to facilitate a more natural syntax using : , rather than using ... Using a boolean indexer you can provide selection related...
Read more >
Dealing with Boolean value across multiple columns
You can cast the boolean to integer and then count the rows that have true in more than one column: select count(*) from...
Read more >
Set Index in pandas DataFrame
This function is used to re-assign a row label using the existing column of the DataFrame. It can assign one or multiple columns...
Read more >
Checks - pandera - Read the Docs
Series input and output a boolean or a Series of boolean values. ... single column, multiple columns, and a more complex groupby function...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found