High cardinality categorical outputs
See original GitHub issueHowdy,
I have a dataset where the output is one of 5k categories. I also have millions of samples. The naive representation of y_indices_naive
(the outputs) is:
[1,5,4300,...]
But it seems that Keras/Theano require one-hot encodings of the output.
Problem is, np_utils.to_categorical(y_indices_naive)
causes an out-of-memory error because then I need a 3mil x 3k matrix.
Is there any way to get Keras to accept y_indices_naive
without converting it to one-hot? I would be happy to add some code if someone would point out how to best do it.
Issue Analytics
- State:
- Created 8 years ago
- Comments:14 (4 by maintainers)
Top Results From Across the Web
Encoding of categorical variables with high cardinality
In target encoding, categorical features are replaced with the mean target value of each respective category. With this technique, the high ...
Read more >How to deal with Features having high cardinality - Kaggle
You can follow the following steps to deal with high cardinality in your data: 1) Check for unique values in your feature. 2)...
Read more >Dealing with categorical features with high cardinality - Medium
One very common step in any feature engineering task is converting categorical features into numerical. Categorical data can pose a serious ...
Read more >Combating High Cardinality Features in Supervised Machine ...
For categorical variables , the number of possible splits grows non linearly with cardinality. If we are splitting the categorical values into 2 ......
Read more >Categorical features with high cardinality: Dealing with ...
Must be remembered, categorical data can pose a serious problem if they have high cardinality i.e too many unique values.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The trick to fix issue with the error expecting a 3D input when using sparse_categorical_crossentropy is to format outputs in a sparse 3-dimensional way. So instead of formatting the output like this:
y_indices_naive = [1,5,4300,...]
is should be formatted this way:
y_indices_naive = [[1,], [5,] , [4300,],...]
That will make Keras happy and it’ll trained the model as expected.
Theano has no support for sparse operations as far as I know (and Keras certainly doesn’t either). So all data will have to be converted to dense arrays at some point. However a 5k-dimensional output space doesn’t seem very large to me.
You can solve your OOM error by one-hot encoding and training batch-by-batch instead of 3M samples at once. Break down your dataset into small batches, and for each batch:
As long as 1) your model fits in memory and 2) your batches are small enough, this will not cause any memory issues.