Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory regression in grouping by categorical variables

See original GitHub issue

There seems to be a regression when grouping by categorical columns. One year old version 0.24.2 was able to complete the queryn, while 1.0.3 is hitting MemoryError.

memory_usage(deep=True) reports size of data frame to be 524 MB, while my machine has 125 GB so memory should not be an issue.

Input

import pandas as pd
import numpy as np

def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

def randFloat(numGrp, N) :
   things = [round(100*np.random.random(),4) for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N = int(1e7)
K = 100
x = pd.DataFrame({
  'id1' : randChar("id%03d", K, N),       # large groups (char)
  'id2' : randChar("id%03d", K, N),       # large groups (char)
  'id3' : randChar("id%010d", N//K, N),   # small groups (char)
  'id4' : np.random.choice(K, N),         # large groups (int)
  'id5' : np.random.choice(K, N),         # large groups (int)
  'id6' : np.random.choice(N//K, N),      # small groups (int)
  'v1' :  np.random.choice(5, N),         # int in range [1,5]
  'v2' :  np.random.choice(5, N),         # int in range [1,5]
  'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
})
x['id1'] = x['id1'].astype('category')
x['id2'] = x['id2'].astype('category')
x['id3'] = x['id3'].astype('category')

import os
import gc
import timeit

print(pd.__version__)

gc.collect()
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
print(t)

Output

1.0.3

print(pd.__version__)
#1.0.3 
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1'
:'count'})
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 928, in aggregate
#    result, how = self._aggregate(func, *args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 419, in _aggregate
#    result = _agg(arg, _agg_1dim)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 386, in _agg
#    result[fname] = func(fname, agg_how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 370, in _agg_1dim
#    return colg.aggregate(how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 247, in aggregate
#    return getattr(self, func)(*args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 1371, in f
#    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 909, in _cython_agg_general
#    return self._wrap_aggregated_output(output)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 386, in _wrap_aggregated_output
#    return self._reindex_output(result)._convert(datetime=True)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 2483, in _reindex_output
#    levels_list, names=self.grouper.names
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/indexes/multi.py", line 552, in from_product
#    codes = cartesian_product(codes)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in cartesian_product
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in <listcomp>
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 445, in repeat
#    return _wrapfunc(a, 'repeat', repeats, axis=axis)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 51, in _wrapfunc
#    return getattr(obj, method)(*args, **kwds)
#MemoryError

0.24.2

print(pd.__version__)
#0.24.2
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
#(10000000, 8)
t = timeit.default_timer() - t_start
print(t)
#9.52401043381542

Issue Analytics

State:
Created 3 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, May 19, 2020

@jangorecki your usage of observed in https://github.com/pandas-dev/pandas/issues/32918#issuecomment-603748383 wasn’t quite right. It should be passed to groupby, not agg. Can you try this?

ans = x.groupby(['id1','id2','id3','id4','id5','id6'], observed=True).agg({'v3':'sum', 'v1':'count'})

1reaction

jrebackcommented, May 13, 2020

@jreback Assinging to milestone gives it attention of volunteers. I think that browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed. Because it is a regression, there are efficient ways, like git bisect, to track down the exact commit that introduced it, making the fix easier much easier to implement.

its contributions welcome. PRs are welcome.

Top Results From Across the Web

PERF: groupby with many empty groups memory blowup ...

I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory....

Regression with Stata Chapter 3 - OARC Stats - UCLA

This chapter will illustrate how you can use Stata for including categorical predictors in your analysis and describe how to interpret the results...

Regression with categorical predictors

Dummy coding provides a way of using categorical predictor variables in regression or other statistical analysis. Dummy coding uses only ones and zeros...

Excel: Including categorical variables in regression - YouTube

Data: https://www.ishelp.info/data/insurance.csvThis video (or a related video) is used in two of my books: 1.

Choosing the Correct Type of Regression Analysis

You can choose from many types of regression analysis. Learn which are appropriate for dependent variables that are continuous, categorical, and count data....