Removing Unused Levels in MultiIndex with NaN values corrupts Index
See original GitHub issueMinimal Verifiable Complete Exemple
Below a MVCE of the behavior:
import pandas as pd
# Trial Data:
data = {
'key1': list(range(6))*2
,'key2': [100, 100, 100, 100, 200, 200, 200, 300, 300, None, None, None]
,'data': ['a']*12
}
# Load Data:
df0 = pd.DataFrame(data)
# Index (Int64 upcasted to Float64, because of None converted into NaN)
df1 = df0.set_index('key2')
# MultiIndex containing NaN on second level:
df2 = df0.set_index(['key1', 'key2'])
# NaN values are replaced by last existing value:
idx = df2.index.remove_unused_levels()
# Then, Index are not equal:
idx.equals(df2.index) # False
Problem description
Using method remove_unused_levels
on MultiIndex containing NaN
create a new MultiIndex that is not equal to the original as documentation says:
The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.
This is why I suspect it is a bug.
Float Index
Single Index
uses NaN
as modality:
>>> df1.index
Float64Index([100.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0, 300.0, 300.0,
nan, nan, nan],
dtype='float64', name='key2')
But, MultiIndex
does not, it has negative modality index instead:
>>> df2.index
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
names=['key1', 'key2'])
MultiIndex corruption
When refreshed, NaN
values point to a copy of the last float
modality (here 300.0
) of the level, this lead to a kind of corrupted index because those auto-filled value do not have any meaning.
>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, 300.0]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 3, 3, 3, 3, 3]],
names=['key1', 'key2'])
As a consequence Index are not equal (which contradicts documentation):
>>> df2.index.remove_unused_levels().equals(df2.index)
False
Even worse, original value (300.0
is not referenced anymore), and then it is a unused value/modality in the newly generated index.
To confirm it, lets apply the method twice, we get:
>>> df2.index.remove_unused_levels().remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2]],
names=['key1', 'key2'])
Expected Output
I believe expected output of set_index
and remove_unused_levels
should be:
>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, nan]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]],
names=['key1', 'key2'])
The problem also occurs when rows are removed from the DataFrame, and then it makes sense to use the method remove_unused_levels
to clean up index. Anyway, when building the MCVE I found it was working on the whole Index whatever the level order.
Pandas Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
this is a duplicate of #18417 and fixed in https://github.com/pandas-dev/pandas/pull/18426 (in master) already. thanks for the report.
In principle, NaNs in indexes should behave just like normal values with respect to comparison (differently form NaNs in values). However this is currently affected by #18455 (which should be fixed soon) for flat
Index
es, and #18485 (which will probably need more time) forMultiIndex
es.I don’t think so… the case of the MVCE above (with no unused levels in the input) is already covered.