Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API: DataFrameGroupBy column subset selection with single list?

See original GitHub issue

I wouldn’t be surprised if there is already an issue about this, but couldn’t directly find one.

When doing a subselection of columns on a DataFrameGroupBy object, both a plain list (so a tuple within the __getitem__ [] brackets) as the double square brackets (a list inside the __getitem__ [] brackets) seems to work:

In [6]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

In [8]: df.groupby('a').sum()
Out[8]: 
    b   c   d
a            
0   0   5   7
3  18   6  12
4  16   6   9
6  10  11  11
9   3   3   0

In [9]: df.groupby('a')['b', 'c'].sum()
Out[9]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

In [10]: df.groupby('a')[['b', 'c']].sum()
Out[10]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

Personally I find this df.groupby('a')['b', 'c'].sum() a bit strange, and inconsistent with how DataFrame indexing works.

Of course, on a DataFrameGroupBy you don’t have the possible confusion with indexing multiple dimensions (rows, columns), but still.

cc @jreback @WillAyd

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:18 (16 by maintainers)

Top GitHub Comments

1reaction

jrebackcommented, Dec 29, 2019

@yehoshuadimarsky we have 3000+ issues and constant comments - to be honest we barely have time to triage on the PRs

even really important things are not necessarily discussed at length

just like everyone else has limited time - the best way to prompt a discussion is to push a change

1reaction

yehoshuadimarskycommented, Dec 25, 2019

So this is my first time working on pandas code, and I’m a little confused here, so please bear with me. I’m also new to linking to code on GitHub.

As I understand, when an object calls __getitem__ by using brackets, if you pass in several keys, they are implicitly converted to a tuple of one key. So df['a','b'] is really df[('a','b')] under the hood.

I’m having trouble in tracing the code path to figure out where exactly the __getitem__ on the GroupBy is actually implemented here:

DataFrame.groupby is called on the superclass NDFrame here
This eventually creates the specific DataFrameGroupBy object here
Which is a subclass of GroupBy
Which is a subclass of _GroupBy
Which has the mixin named SelectionMixin, defined here
Which implements __getitem__ here
Which, if the key is a list or tuple, returns self._gotitem(list(key), ndim=2)
self._gotitem needs to be implemented by the respective subclasses, which in this case is the DataFrameGroupBy object, and is implemented here
But all this does is simply create an instance of itself (DataFrameGroupBy) with the key (a list/tuple) passed as a slice to the selection parameter
The selection parameter is implemented in the parent _GroupBy object, which sets the internal self._selection attribute to the key here
This is where I’m lost. How does this actually slice the object and only return a subset of it?