question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: slicing a MultiIndex does not preserve the sequence of the index since pandas 1.2.0rc0

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Between 1.1.5 and 1.2.0rc2 the behavior of slicing MultiIndexes has changed. Up to 1.1.5 slicing a MultiIndex like df.loc[(slice(None), somel_items), :] preserved the sequence of the sliced DataFrame. From 1.2.0rc0 the sequence is changed.

Code Sample

import pandas as pd

print("pandas version %s" % pd.__version__)

index = pd.MultiIndex.from_tuples([(1, 1), (1, 2), (1, 7), (1, 6),
                                   (2, 2), (2, 3), (2, 8), (2, 7)])

all_items = index.get_level_values(1) # all items from level 1

df = pd.DataFrame({'x': range(8)}, index=index)


df_sliced = df.loc[(slice(None), all_items), :]
# df_sliced, should be identical with df, as all_items contains all items from level 1

print(df_sliced)

pd.testing.assert_frame_equal(df, df_sliced)
# works if and only if pd.__version__ < 1.2.0rc0

print("Success")

Problem description

Running the sample code in 1.1.5 gives the following output:

pandas version 1.1.5
     x
1 1  0
  2  1
  7  2
  6  3
2 2  4
  3  5
  8  6
  7  7
Success

whereas in 1.2.4 it gives

pandas version 1.2.4
     x
1 1  0
  6  3
  2  1
2 2  4
  3  5
  8  6
1 7  2
2 7  7
Traceback (most recent call last):
  File "tmp/pandas_demo.py", line 19, in <module>
    pd.testing.assert_frame_equal(df, df_sliced)
  File "/home/jmu3si/Devel/pylife/.venv/lib/python3.8/site-packages/pandas/_testing.py", line 1657, in assert_frame_equal
    assert_index_equal(
  File "/home/jmu3si/Devel/pylife/.venv/lib/python3.8/site-packages/pandas/_testing.py", line 805, in assert_index_equal
    assert_index_equal(
  File "/home/jmu3si/Devel/pylife/.venv/lib/python3.8/site-packages/pandas/_testing.py", line 825, in assert_index_equal
    _testing.assert_almost_equal(
  File "pandas/_libs/testing.pyx", line 46, in pandas._libs.testing.assert_almost_equal
  File "pandas/_libs/testing.pyx", line 161, in pandas._libs.testing.assert_almost_equal
  File "/home/jmu3si/Devel/pylife/.venv/lib/python3.8/site-packages/pandas/_testing.py", line 1073, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: MultiIndex level [0] are different

MultiIndex level [0] values are different (25.0 %)
[left]:  Int64Index([1, 1, 1, 1, 2, 2, 2, 2], dtype='int64')
[right]: Int64Index([1, 1, 1, 2, 2, 2, 1, 2], dtype='int64')

I am not sure if this is necessarily a problem, I stumbled across it because a test suite that used pd.testing.assert_frame_equal() failed due to this. So either the actual sequence should be preserved when slicing a DataFrame or pd.testing.assert_frame_equal() should not fail if the sequence is shuffled (but the index is correct).

Expected Output

As discussed above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-71-lowlatency Version : #79-Ubuntu SMP PREEMPT Wed Mar 24 12:38:51 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : de_DE.UTF-8

pandas : 1.2.4 (resp. 1.1.5) numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 20.2.4 setuptools : 50.3.0.post20201006 Cython : 0.29.23 pytest : 6.2.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : None tables : None tabulate : None xarray : 0.17.0 xlrd : None xlwt : None numba : None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
phoflcommented, Apr 16, 2021

The ordering is wrong for duplicates like in your example, will fix this shortly but lets leave open till then

1reaction
phoflcommented, Apr 16, 2021

Intersection with sort=False is first thing which comes to mind.

Loc should in theory select in the order your indexer is, so your incoming sequence has to be the same order somehow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MultiIndex / advanced indexing — pandas 1.5.2 documentation
The MultiIndex object is the hierarchical analogue of the standard Index object ... You can slice with a 'range' of values, by providing...
Read more >
MultiIndex / advanced indexing — pandas 1.2.1 documentation
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may...
Read more >
What's new in 1.2.0 (December 26, 2020) - Pandas
When aggregating using concat() or the DataFrame constructor, pandas will now attempt to preserve index and column names whenever possible (GH35847).
Read more >
Indexing and selecting data — pandas 1.5.2 documentation
When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they...
Read more >
Indexing and selecting data — pandas 1.2.3 documentation
This use is not an integer position along the index. ... A slice object with labels 'a':'f' (Note that contrary to usual Python...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found