Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Grouped dataframe "name" attribute overrides column access / not well documented

See original GitHub issue

Code Sample

>>> df = pd.DataFrame({'val': [9, 10, 3, 6, 2, 3], 'name': list('xxyxyy'), 'group': list('aaabbb')})
>>> df

	val	name	group
0	9	x	a
1	10	x	a
2	3	y	a
3	6	x	b
4	2	y	b
5	3	y	b

Works correctly:

>>> df.groupby('group').apply(lambda g: g[g['name'] == 'x'])

		val	name	group
group				
a	0	9	x	a
	1	10	x	a
b	3	6	x	b

Errors out:

>>> df.groupby('group').apply(lambda g: g[g.name == 'x'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

Rest of the traceback is here:

~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2655             try:
-> 2656                 return self._engine.get_loc(key)
   2657             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    688             try:
--> 689                 result = self._python_apply_general(f)
    690             except Exception:

~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 

~/venv/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):

<ipython-input-282-522c70a9fa21> in <lambda>(g)
----> 1 df.groupby('group').apply(lambda g: g[g.name == 'x'])

~/venv/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):

~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2657             except KeyError:
-> 2658                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2659         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2655             try:
-> 2656                 return self._engine.get_loc(key)
   2657             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-282-522c70a9fa21> in <module>
----> 1 df.groupby('group').apply(lambda g: g[g.name == 'x'])

~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    699 
    700                 with _group_selection_context(self):
--> 701                     return self._python_apply_general(f)
    702 
    703         return result

~/venv/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    705     def _python_apply_general(self, f):
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 
    709         return self._wrap_applied_output(

~/venv/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    188             # group might be modified
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):
    192                 mutated = True

<ipython-input-282-522c70a9fa21> in <lambda>(g)
----> 1 df.groupby('group').apply(lambda g: g[g.name == 'x'])

~/venv/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

~/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656                 return self._engine.get_loc(key)
   2657             except KeyError:
-> 2658                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2659         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2660         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

Problem description

When you call a method on a groupby that expects a function, the documentation states that the function should expect a DataFrame. Indeed if you inspect the type, a DataFrame is passed. However, this DataFrame has a new attribute added — name — and this overwrites the dot accessor for a column "name" if it exists. This is a little troubling since doing dotted access will work outside of the groupby and not inside the groupby function.

First, this needs to be documented more explicitly—I found it once in the documentation for one of the related functions that the .name attribute gets added, but cannot find it again, so I’m not sure which one it was. Edit: It was the transform() method’s docstring, as shown in a comment below.

https://github.com/pandas-dev/pandas/issues/9545

Expected Output

The expected output is what happens when you use the [] indexer, instead of dot access.

Suggestions

Not necessarily mutually exclusive:

Give a better error message so that it’s known .name is an attribute of the group DataFrame
Improve documentation on all related methods to know the .name attribute exists and will override the column dot access
Only add the .name attribute if a column of that name doesn’t exist
Append an underscore to the attribute so there’s a much lower chance of conflicts, e.g. groupdf.name_
Change the attribute name entirely for similar lower chance of conflicts, e.g. groupdf.grouped_value
Move into a method call instead, e.g. groupdf.get_group_name()
Add a kwarg to apply() / transform() which toggles whether to send a second argument into the function, that second argument being the group name

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.7.0.beta.4 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.1 pytest: None pip: 19.0.3 setuptools: 39.0.1 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: 7.3.0 sphinx: None patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: 1.2.18 pymysql: None psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

State:
Created 5 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

alkasmcommented, Feb 27, 2019

It is a documented limitation of attribute access that things won’t work if it conflicts with an existing attribute name

The problem IMO is that it isn’t an existing attribute name until it gets passed into the function. It’s an undocumented attribute of a dataframe that gets “magically” added. It’s not part of the dataframe interface and isn’t something that exists with dir() or help() or whatever on pd.DataFrame.

I understand that dotted access is syntactic sugar for quicker interactive work, but I’m more concerned with something working outside of a groupby function and then not working inside the function that acts on each group. The API just suddenly changes in one part of a pipeline.

If this is something that happens often in Pandas (adding new attributes to dataframes when passing them around), then I guess this is something that just needs to be documented better. But if it’s a one-off case, I believe it’s worth re-thinking the necessity of adding/overwriting an attribute.

0reactions

alkasmcommented, Feb 27, 2019

Interestingly enough that docstring you mention isn’t part of the API.

What do you mean? The df.groupby('column')) object is of type pandas.core.groupby.groupby.DataFrameGroupBy so I pulled the docstring from that class’s transform method.

There’s a similar thing also at Line 481 in the NDFrameGroupBy (which is subclassed by DataFrameGroupBy):

https://github.com/pandas-dev/pandas/blob/fe1654faa86836a0007bb513504e57c5c9935b8b/pandas/core/groupby/generic.py#L480-L481

Top Results From Across the Web

GroupBy pandas DataFrame and select most common value

To clean the data I have to group by data frame by first two columns and select most common value of the third...

Group by: split-apply-combine — pandas 1.5.2 documentation

A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index...

Working with DataFrames in Snowpark Python

Specify how the dataset in the DataFrame should be transformed. For example, you can specify which columns should be selected, how the rows...

dplyr.pdf

function that takes a data frame. Within these functions you can use cur_column() and cur_group() to access the current column and grouping ......

pandas GroupBy: Your Guide to Grouping Data in Python

groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Grouped dataframe "name" attribute overrides column access / not well documented

Code Sample

Problem description

Related

Expected Output

Suggestions

Output of `pd.show_versions()`

INSTALLED VERSIONS

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

"SpecificationError: nested dictionary is ambiguous in aggregation" in a certain case of groupby-aggregation

read_excel throws ValueError: cannot specify usecols when specifying a multi-index header

Grouped dataframe "name" attribute overrides column access / not well documented

Code Sample

Problem description

Related

Expected Output

Suggestions

Output of pd.show_versions()

INSTALLED VERSIONS

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

"SpecificationError: nested dictionary is ambiguous in aggregation" in a certain case of groupby-aggregation

read_excel throws ValueError: cannot specify usecols when specifying a multi-index header

Output of `pd.show_versions()`