question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory regression in grouping by categorical variables

See original GitHub issue

There seems to be a regression when grouping by categorical columns. One year old version 0.24.2 was able to complete the queryn, while 1.0.3 is hitting MemoryError.

memory_usage(deep=True) reports size of data frame to be 524 MB, while my machine has 125 GB so memory should not be an issue.

Input

import pandas as pd
import numpy as np

def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

def randFloat(numGrp, N) :
   things = [round(100*np.random.random(),4) for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N = int(1e7)
K = 100
x = pd.DataFrame({
  'id1' : randChar("id%03d", K, N),       # large groups (char)
  'id2' : randChar("id%03d", K, N),       # large groups (char)
  'id3' : randChar("id%010d", N//K, N),   # small groups (char)
  'id4' : np.random.choice(K, N),         # large groups (int)
  'id5' : np.random.choice(K, N),         # large groups (int)
  'id6' : np.random.choice(N//K, N),      # small groups (int)
  'v1' :  np.random.choice(5, N),         # int in range [1,5]
  'v2' :  np.random.choice(5, N),         # int in range [1,5]
  'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
})
x['id1'] = x['id1'].astype('category')
x['id2'] = x['id2'].astype('category')
x['id3'] = x['id3'].astype('category')

import os
import gc
import timeit

print(pd.__version__)

gc.collect()
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
print(t)

Output

1.0.3

print(pd.__version__)
#1.0.3 
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1'
:'count'})
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 928, in aggregate
#    result, how = self._aggregate(func, *args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 419, in _aggregate
#    result = _agg(arg, _agg_1dim)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 386, in _agg
#    result[fname] = func(fname, agg_how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 370, in _agg_1dim
#    return colg.aggregate(how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 247, in aggregate
#    return getattr(self, func)(*args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 1371, in f
#    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 909, in _cython_agg_general
#    return self._wrap_aggregated_output(output)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 386, in _wrap_aggregated_output
#    return self._reindex_output(result)._convert(datetime=True)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 2483, in _reindex_output
#    levels_list, names=self.grouper.names
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/indexes/multi.py", line 552, in from_product
#    codes = cartesian_product(codes)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in cartesian_product
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in <listcomp>
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 445, in repeat
#    return _wrapfunc(a, 'repeat', repeats, axis=axis)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 51, in _wrapfunc
#    return getattr(obj, method)(*args, **kwds)
#MemoryError

0.24.2

print(pd.__version__)
#0.24.2
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
#(10000000, 8)
t = timeit.default_timer() - t_start
print(t)
#9.52401043381542

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, May 19, 2020

@jangorecki your usage of observed in https://github.com/pandas-dev/pandas/issues/32918#issuecomment-603748383 wasn’t quite right. It should be passed to groupby, not agg. Can you try this?

ans = x.groupby(['id1','id2','id3','id4','id5','id6'], observed=True).agg({'v3':'sum', 'v1':'count'})
1reaction
jrebackcommented, May 13, 2020

@jreback Assinging to milestone gives it attention of volunteers. I think that browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed. Because it is a regression, there are efficient ways, like git bisect, to track down the exact commit that introduced it, making the fix easier much easier to implement.

its contributions welcome. PRs are welcome.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PERF: groupby with many empty groups memory blowup ...
I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory....
Read more >
Regression with Stata Chapter 3 - OARC Stats - UCLA
This chapter will illustrate how you can use Stata for including categorical predictors in your analysis and describe how to interpret the results...
Read more >
Regression with categorical predictors
Dummy coding provides a way of using categorical predictor variables in regression or other statistical analysis. Dummy coding uses only ones and zeros...
Read more >
Excel: Including categorical variables in regression - YouTube
Data: https://www.ishelp.info/data/insurance.csvThis video (or a related video) is used in two of my books: 1.
Read more >
Choosing the Correct Type of Regression Analysis
You can choose from many types of regression analysis. Learn which are appropriate for dependent variables that are continuous, categorical, and count data....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found