memory regression in grouping by categorical variables
See original GitHub issueThere seems to be a regression when grouping by categorical columns. One year old version 0.24.2 was able to complete the queryn, while 1.0.3 is hitting MemoryError.
memory_usage(deep=True)
reports size of data frame to be 524 MB, while my machine has 125 GB so memory should not be an issue.
Input
import pandas as pd
import numpy as np
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
def randFloat(numGrp, N) :
things = [round(100*np.random.random(),4) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N = int(1e7)
K = 100
x = pd.DataFrame({
'id1' : randChar("id%03d", K, N), # large groups (char)
'id2' : randChar("id%03d", K, N), # large groups (char)
'id3' : randChar("id%010d", N//K, N), # small groups (char)
'id4' : np.random.choice(K, N), # large groups (int)
'id5' : np.random.choice(K, N), # large groups (int)
'id6' : np.random.choice(N//K, N), # small groups (int)
'v1' : np.random.choice(5, N), # int in range [1,5]
'v2' : np.random.choice(5, N), # int in range [1,5]
'v3' : randFloat(100,N) # numeric e.g. 23.5749
})
x['id1'] = x['id1'].astype('category')
x['id2'] = x['id2'].astype('category')
x['id3'] = x['id3'].astype('category')
import os
import gc
import timeit
print(pd.__version__)
gc.collect()
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
print(t)
Output
1.0.3
print(pd.__version__)
#1.0.3
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1'
:'count'})
#Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 928, in aggregate
# result, how = self._aggregate(func, *args, **kwargs)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 419, in _aggregate
# result = _agg(arg, _agg_1dim)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 386, in _agg
# result[fname] = func(fname, agg_how)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 370, in _agg_1dim
# return colg.aggregate(how)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 247, in aggregate
# return getattr(self, func)(*args, **kwargs)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 1371, in f
# return self._cython_agg_general(alias, alt=npfunc, **kwargs)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 909, in _cython_agg_general
# return self._wrap_aggregated_output(output)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 386, in _wrap_aggregated_output
# return self._reindex_output(result)._convert(datetime=True)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 2483, in _reindex_output
# levels_list, names=self.grouper.names
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/indexes/multi.py", line 552, in from_product
# codes = cartesian_product(codes)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in cartesian_product
# for i, x in enumerate(X)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in <listcomp>
# for i, x in enumerate(X)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 445, in repeat
# return _wrapfunc(a, 'repeat', repeats, axis=axis)
# File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 51, in _wrapfunc
# return getattr(obj, method)(*args, **kwds)
#MemoryError
0.24.2
print(pd.__version__)
#0.24.2
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
#(10000000, 8)
t = timeit.default_timer() - t_start
print(t)
#9.52401043381542
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
PERF: groupby with many empty groups memory blowup ...
I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory....
Read more >Regression with Stata Chapter 3 - OARC Stats - UCLA
This chapter will illustrate how you can use Stata for including categorical predictors in your analysis and describe how to interpret the results...
Read more >Regression with categorical predictors
Dummy coding provides a way of using categorical predictor variables in regression or other statistical analysis. Dummy coding uses only ones and zeros...
Read more >Excel: Including categorical variables in regression - YouTube
Data: https://www.ishelp.info/data/insurance.csvThis video (or a related video) is used in two of my books: 1.
Read more >Choosing the Correct Type of Regression Analysis
You can choose from many types of regression analysis. Learn which are appropriate for dependent variables that are continuous, categorical, and count data....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jangorecki your usage of
observed
in https://github.com/pandas-dev/pandas/issues/32918#issuecomment-603748383 wasn’t quite right. It should be passed togroupby
, notagg
. Can you try this?its contributions welcome. PRs are welcome.