question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1)

See original GitHub issue

Code Sample

import pandas as pd # 1.0.1
import numpy as np # 1.18.1

# Simple test case that fails
df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
                          'good_string' : ['cat','dog','cat','dog','fish','pig'],
                          'bad_string' : ['cat',np.nan,np.nan, np.nan, np.nan, np.nan]})

df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

>>>line 17, in <module>
    df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1378, in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1004, in _cython_agg_general
    how, alt=alt, numeric_only=numeric_only, min_count=min_count

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1099, in _cython_agg_blocks
    agg_block: Block = block.make_block(result)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 273, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3041, in make_block
    return klass(values, ndim=ndim, placement=placement)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2589, in __init__
    super().__init__(values, ndim=ndim, placement=placement)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 125, in __init__
    f"Wrong number of items passed {len(self.values)}, "

ValueError: Wrong number of items passed 1, placement implies 2
-------------

# Add one more legitimate string value to the 'bad_string' column and it works
df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
                          'good_string' : ['cat','dog','cat','dog','fish','pig'],
                          'bad_string' : ['cat','dog',np.nan, np.nan, np.nan, np.nan]})

df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

Problem description

Unless I’ve misunderstood something fundamental about the max, and min, aggregate functions, I don’t think they should error out when a Series in the DataFrame is of type object and contains all but one NaN value. Notice from the example above that just adding one more non-NaN value to the offending Series gets around the ValueError.

Expected Output

No ValueError; groupby object returned.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.2.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.6.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.4.1
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.1
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.3.5
pyxlsb           : None
s3fs             : 0.4.0
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.7
numba            : 0.43.1

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dz0commented, Jun 2, 2021

I am still getting this in 1.2.4

$ pip show pandas
Name: pandas
Version: 1.2.4

Traceback for the same exampole

  File "/home/jurgis/.config/JetBrains/PyCharm2020.3/scratches/scratch_76.py", line 9, in <module>
    df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1676, in max
    return self._agg_general(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1024, in _agg_general
    result = self._cython_agg_general(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1015, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1118, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 380, in apply
    return self._split_op_result(result)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 416, in _split_op_result
    result = self.make_block(result)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 286, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2742, in make_block
    return klass(values, ndim=ndim, placement=placement)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 142, in __init__
    raise ValueError(
ValueError: Wrong number of items passed 1, placement implies 2
0reactions
mroeschkecommented, Dec 28, 2021

Actually given the new deprecation (non numeric columns need to be de-selected before calling max), this behavior will be removed in 2.0 and tests for aggregating on only numeric columns has good test coverage. Closing as this behavior is deprecated.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Group By: split-apply-combine — pandas 1.0.3 documentation
Aggregation : compute a summary statistic (or statistics) for each group. ... both a column name and an index level name, a ValueError...
Read more >
Pandas Dataframe groupby aggregate functions and ...
This is because by adding in another aggregator, you're asking pandas to find the min and max twice for each group. Once for...
Read more >
Source code for pyspark.pandas.groupby - Apache Spark
A wrapper for GroupedData to behave similar to pandas GroupBy. ... It can also be used when applying multiple aggregation functions to specific...
Read more >
GroupBy One Column and Get Mean, Min, and Max values
We can use Groupby function to split dataframe into groups and apply different operations on it. One of them is Aggregation.
Read more >
Data Analysis with Python
In NumPy, you could choose to operate on the entire array, or a particular axis with the keyword argument axis=n . NumPy's Aggregate...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found