question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: groupby with sort=False creates buggy MultiIndex

See original GitHub issue
d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-13'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

The index of t is certainly not sorted, but t.index.is_lexsorted() returns True.

Another more subtle example is

d = [3,4,10,0,1,2,5,3]
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

This time the lexsort flag is correct. However, calling sortlevel will not sort the new MultiIndex correctly, that is, t.index.sortlevel(['d', 'a'])[0] returns

MultiIndex([( 3, 0),
            ( 3, 7),
            ( 4, 1),
            (10, 2),
            ( 0, 3),
            ( 1, 4),
            ( 2, 5),
            ( 5, 6)],
           names=['d', 'a'])

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-72-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : 5.3.5 hypothesis : 5.5.4 sphinx : 2.4.0 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: 0.8.1 bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 5.3.5 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.13 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.7 numba : 0.45.1

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
MarcoGorellicommented, Feb 27, 2020

If you do df.unstack(‘d’).stack().index, the resulting index is unsorted on level ‘d’, but its .is_lexsorted() returns True.

In this case, isn’t the current result correct though?

In [1]: import pandas as pd; import numpy as np                                                                                                                                                                  

In [2]: d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14']) 
   ...: a = np.arange(len(d)) 
   ...: b = np.random.rand(len(d)) 
   ...: df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])                                                                                                                

In [3]: df.unstack('d').stack().index                                                                                                                                                                            
Out[3]: 
MultiIndex([(0, '2020-11-02'),
            (1, '2019-01-02'),
            (2, '2020-01-02'),
            (3, '2020-02-04'),
            (4, '2020-11-03'),
            (5, '2019-11-03'),
            (6, '2019-11-13'),
            (7, '2019-11-14')],
           names=['a', 'd'])

The index is sorted by its first argument, and so it is correct to say that it’s lexically sorted

1reaction
MarcoGorellicommented, Feb 26, 2020

(just some notes, will come back to this)

is_lexsorted assumes the order of .codes reflects the order of the elements in the index. But .codes is actually determined by

                codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)

in pandas/core/groupby/grouper.py, where self.sort may be False

Read more comments on GitHub >

github_iconTop Results From Across the Web

Setting sort = False in groupby unstacking my multiindex
I want the behavior of sort = True, without the sorting. Here is the code for the dataframe: import pandas as pd print(pd.__version ......
Read more >
What's New — pandas 0.19.2 documentation
Bug in groupby-transform broadcasting that could cause incorrect dtype coercion ... Bug in MultiIndex.set_levels where illegal level values were still set ...
Read more >
What's New — pandas 0.23.4 documentation
Bug in to_csv() causes encoding error when compression and encoding are specified ... In previous versions, .groupby(..., sort=False) would fail with a ...
Read more >
64 Pandas (Part 41): GroupBy - 2: MultiIndex, Sort ... - YouTube
The video discusses GroupBy in Pandas in Python using MultiIndex DataFrame, ... GroupBy : sort=False : .sum() 03:16 - GroupBy : create grouped ......
Read more >
Performing Groupings on Multi-Index Pandas DataFrames
The multi-index dataframe is now as shown: If you set sort=False in the groupby() function, you will get a different result ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found