BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values
See original GitHub issueCode Sample
import pandas
import numpy as np
a = np.array([[1,2],[3,4]])
# DO NOT WORKS
b = np.array([[0.5,6],[7,8]])
# b = np.array([[.5,6],[7,8]]) # The same problem
# This one works fine:
# b = np.array([[5,6],[7,8]])
dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)
df_new = pandas.concat([dfA, dfB], axis = 1)
print(df_new[df_new > 5])
Problem description
It has a bug that combines numpy specific values and duplicated DataFrame column names when it’s used a select operation, such as df[df > 5]
. A exception is thrown saying “cannot reindex from duplicate axis”, however It should not be, because:
- The DataFrame has no duplicated indexes (
df.index.is_unique
isTrue
) - The DataFrame has duplicated column names, but should not be a problem when we apply the selection operation, such as
df_new[df_new > 5]
- The DataFrame uses
float
orint
numpy values, so it should not change the behavior of the code
However the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.
Expected Output
0 1 0 1
0 NaN NaN NaN 6
1 NaN NaN 7.0 8
Current Output
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-28-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : pt_BR.UTF-8
pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : None pytest : None hypothesis : None sphinx : 2.3.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : 3.6.1 tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
Issue Analytics
- State:
- Created 4 years ago
- Comments:32 (17 by maintainers)
Top GitHub Comments
Yup 😄 @GabrielSimonetto if you wanted to submit a test to make sure this doesn’t break again in the future, that would be welcome!
@MarcoGorelli I found that building instead with the command
CFLAGS='-Wno-error=deprecated-declarations' python setup.py build_ext -i
generally fixes things, although I’m not sure if it’ll work in this case. There’s a thread about these problems here: https://github.com/pandas-dev/pandas/issues/33315