question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: groupby.max() should not cast int to int64 but keep original data type

See original GitHub issue

Is your feature request related to a problem?

In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in

MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64

Traceback:

/python3.9/site-packages/pandas/core/dtypes/common.py in ensure_int_or_float(arr, copy)
    143     try:
    144         # error: Unexpected keyword argument "casting" for "astype"
--> 145         return arr.astype("int64", copy=copy, casting="safe")  # type: ignore[call-arg]
    146     except TypeError:
    147         pass

Describe the solution you’d like

Keep the original datatype, in this case int8.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
arubialescommented, Jul 1, 2021

Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it.

Thank you for the info, I’m going to review it and take and overall idea of how everything is connected.

1reaction
rd-andreas-laycommented, Jul 16, 2021

@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB).

I’d have to run an example through the debugger though to see where the re-casting to int8 happens.

If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage.

Read more comments on GitHub >

github_iconTop Results From Across the Web

BUG: groupby().agg( ) with min/max on Int64 leads to incorrect ...
If we keep the ts column at int64 (i.e. use the non-nullable type) the result is ... we actually want to keep the...
Read more >
Stop Pandas from converting int to float due to an insertion in ...
I understand that if I insert NaN into the int column, Pandas will convert all the int into float because there is no...
Read more >
What's new in 1.4.0 (January 22, 2022) — pandas 1.5.1 ...
One exception to this is SparseArray , which will continue to cast to numpy dtype until pandas 2.0. At that point it will...
Read more >
Pandas Convert Column to Int in DataFrame
Use pandas DataFrame.astype(int) and DataFrame.apply() methods to convert a column to int (float/string to integer/int64/int32 dtype) data type. If you.
Read more >
DataFrame Reference — PyODPS 0.11.2.2 documentation
Users can initial a DataFrame by odps.models.Table . ... but get all the columns >>> df[df, df.name.lower().rename('name2')] ... Cast to a new data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found