question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Removing Unused Levels in MultiIndex with NaN values corrupts Index

See original GitHub issue

Minimal Verifiable Complete Exemple

Below a MVCE of the behavior:

import pandas as pd
# Trial Data:
data = {
    'key1': list(range(6))*2
   ,'key2': [100, 100, 100, 100, 200, 200, 200, 300, 300, None, None, None]
   ,'data': ['a']*12
}
# Load Data:
df0 = pd.DataFrame(data)
# Index (Int64 upcasted to Float64, because of None converted into NaN)
df1 = df0.set_index('key2')
# MultiIndex containing NaN on second level:
df2 = df0.set_index(['key1', 'key2'])
# NaN values are replaced by last existing value:
idx = df2.index.remove_unused_levels()
# Then, Index are not equal:
idx.equals(df2.index) # False

Problem description

Using method remove_unused_levels on MultiIndex containing NaN create a new MultiIndex that is not equal to the original as documentation says:

The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.

This is why I suspect it is a bug.

Float Index

Single Index uses NaN as modality:

>>> df1.index
Float64Index([100.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0, 300.0, 300.0,
              nan, nan, nan],
             dtype='float64', name='key2')

But, MultiIndex does not, it has negative modality index instead:

>>> df2.index
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
           names=['key1', 'key2'])

MultiIndex corruption

When refreshed, NaN values point to a copy of the last float modality (here 300.0) of the level, this lead to a kind of corrupted index because those auto-filled value do not have any meaning.

>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 3, 3, 3, 3, 3]],
           names=['key1', 'key2'])

As a consequence Index are not equal (which contradicts documentation):

>>> df2.index.remove_unused_levels().equals(df2.index)
False

Even worse, original value (300.0 is not referenced anymore), and then it is a unused value/modality in the newly generated index.

To confirm it, lets apply the method twice, we get:

>>> df2.index.remove_unused_levels().remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2]],
           names=['key1', 'key2'])

Expected Output

I believe expected output of set_index and remove_unused_levels should be:

>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, nan]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]],
           names=['key1', 'key2'])

The problem also occurs when rows are removed from the DataFrame, and then it makes sense to use the method remove_unused_levels to clean up index. Anyway, when building the MCVE I found it was working on the whole Index whatever the level order.

Pandas Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, Dec 3, 2017

this is a duplicate of #18417 and fixed in https://github.com/pandas-dev/pandas/pull/18426 (in master) already. thanks for the report.

0reactions
toobazcommented, Dec 4, 2017

Yes I guess dealing with NaN in Index may lead to a lot of troubles (this entity does not behave well, such as comparison).

In principle, NaNs in indexes should behave just like normal values with respect to comparison (differently form NaNs in values). However this is currently affected by #18455 (which should be fixed soon) for flat Indexes, and #18485 (which will probably need more time) for MultiIndexes.

cc @toobaz anything in here we should add to tests?

I don’t think so… the case of the MVCE above (with no unused levels in the input) is already covered.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.MultiIndex.remove_unused_levels
Create new MultiIndex from current that removes unused levels. Unused level(s) means levels that are not expressed in the labels. The resulting MultiIndex...
Read more >
How do you update the levels of a pandas MultiIndex after ...
This returns the current active values within that index level. ... To remove "CAN" from the enumeration, you have to alter the column...
Read more >
Pandas Drop Level From Multi-Level Column Index
By using DataFrame.droplevel() or DataFrame.columns.droplevel() you can drop a level from multi-level column index from pandas DataFrame.
Read more >
What's New — pandas 0.23.4 documentation
All-NaN levels in a MultiIndex are now assigned float rather than object dtype, promoting consistency with Index (GH17929). Levels names of a MultiIndex...
Read more >
Drop specific rows from multiindex Pandas Dataframe
level : parameter indicates the level number like 0,1 etc to help identify and manipulate a particular level of data in a multi-index...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found