MinMaxScaler output datatype
See original GitHub issueDescribe the workflow you want to enable
Currently, applying a MinMaxScaler to data that includes features with small datatypes like int8 results in float64 output. I would like to have a way to output to a datatype that is also some low precision type such as float16, and I don’t believe this is supported in the MinMaxScaler today without applying another transformation after applying the minmax scaler. This likely applies to other scaling functions as well.
I would like this capability in order to avoid running out of memory on large-ish datasets that could be operated on in one VM, but can’t after many of my columns turn into higher-than-necessary-precision datatypes.
Describe your proposed solution
I’d modify MinMaxScaler to accept an optional output data type argument and then cast values to that type while performing the necessary arithmetic for scaling.
Describe alternatives you’ve considered, if relevant
The casting operation could happen at many points, including after performing arithmetic in a higher precision type like float64, which could be important in some use cases to avoid loss of precision.
Additional context
Here’s an example of a case based on MinMaxScaler docs where applying the MinMaxScaler takes int8 data and turns the result into float64
from sklearn.preprocessing import MinMaxScaler
data = np.array([[-1, 2], [-1, 6], [0, 10], [1, 18]], dtype=np.int8)
scaler = MinMaxScaler()
assert(data.dtype==np.int8)
assert(scaler.transform(data).dtype == np.float64)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:6 (3 by maintainers)
Hey @nkelly13, if you comment with the single word “take” the bot will assign the task to you at which point you can develop and make a pull request when ready. Make sure to follow the contribution guidelines 😄.
I think we could add a
dtype=None
init parameter (with the current behavior by default) that would need to be passed here since there is indeed little control on howMinMaxScaler
makes the dtype conversion.Would you be interested in making a pull request?