GroupBy(axis=1) Does Not Offer Implicit Selection By Columns Name(s)
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(3, 4), index=[0, 1, 0], columns=[10, 20, 10, 20])
df.index.name = "y"
df.columns.name = "x"
print df
print
print "Grouped along index:"
print df.groupby(by="y").sum()
print
print "Grouped along columns:"
# The following raises a KeyError even though "x" is a column name
# (like "y" above, which is an index name):
df.groupby(by="x", axis=1).sum()
Problem description
The exception at the end is surprising: the intent is clearly to group by columns, on the “x” column label.
Furthermore, the documentation for groupby()
seems to confirm this, as it states for the “by” argument that “A str or list of strs may be passed to group by the columns in self
”.
Expected Output
A dataframe with index [0, 1, 0] but grouped (and summed) columns [10, 20].
I wasn’t able to test with the latest Pandas version, sorry!
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.13.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-642.15.1.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None
pandas: 0.21.1 pytest: 3.2.3 pip: 9.0.1 setuptools: 28.8.0 Cython: 0.27.3 numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: 0.8.2 IPython: 5.1.0 sphinx: 1.4.4 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.5 feather: None matplotlib: 2.1.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.18 pymysql: None psycopg2: None jinja2: 2.8 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
I’ve never understood
by
to only be usable for index names, and becausegroupby
allowsaxis=1
, it would make sense to me thatby
follows the supplied axis, and not necessarily only follows the index.So I see this as bug, given that
axis=1
should give the same functionality asaxis=0
, but onlyshould work over the other axis?The example that @lebigot shows, should return the same as
df.T.groupby(by="x").sum().T
and should returnSeems like a bug in groupby to me.
@lebigot if you’d want to investigate, that’d be great.