ENH: groupby.max() should not cast int to int64 but keep original data type
See original GitHub issueIs your feature request related to a problem?
In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in
MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64
Traceback:
/python3.9/site-packages/pandas/core/dtypes/common.py in ensure_int_or_float(arr, copy)
143 try:
144 # error: Unexpected keyword argument "casting" for "astype"
--> 145 return arr.astype("int64", copy=copy, casting="safe") # type: ignore[call-arg]
146 except TypeError:
147 pass
Describe the solution you’d like
Keep the original datatype, in this case int8.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
BUG: groupby().agg( ) with min/max on Int64 leads to incorrect ...
If we keep the ts column at int64 (i.e. use the non-nullable type) the result is ... we actually want to keep the...
Read more >Stop Pandas from converting int to float due to an insertion in ...
I understand that if I insert NaN into the int column, Pandas will convert all the int into float because there is no...
Read more >What's new in 1.4.0 (January 22, 2022) — pandas 1.5.1 ...
One exception to this is SparseArray , which will continue to cast to numpy dtype until pandas 2.0. At that point it will...
Read more >Pandas Convert Column to Int in DataFrame
Use pandas DataFrame.astype(int) and DataFrame.apply() methods to convert a column to int (float/string to integer/int64/int32 dtype) data type. If you.
Read more >DataFrame Reference — PyODPS 0.11.2.2 documentation
Users can initial a DataFrame by odps.models.Table . ... but get all the columns >>> df[df, df.name.lower().rename('name2')] ... Cast to a new data...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it.
Thank you for the info, I’m going to review it and take and overall idea of how everything is connected.
@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB).
I’d have to run an example through the debugger though to see where the re-casting to int8 happens.
If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage.