Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High cardinality categorical outputs

See original GitHub issue

Howdy,

I have a dataset where the output is one of 5k categories. I also have millions of samples. The naive representation of y_indices_naive (the outputs) is:

[1,5,4300,...]

But it seems that Keras/Theano require one-hot encodings of the output.

Problem is, np_utils.to_categorical(y_indices_naive) causes an out-of-memory error because then I need a 3mil x 3k matrix.

Is there any way to get Keras to accept y_indices_naive without converting it to one-hot? I would be happy to add some code if someone would point out how to best do it.

Issue Analytics

State:
Created 8 years ago
Comments:14 (4 by maintainers)

Top GitHub Comments

2reactions

paulomalvarcommented, Apr 28, 2017

The trick to fix issue with the error expecting a 3D input when using sparse_categorical_crossentropy is to format outputs in a sparse 3-dimensional way. So instead of formatting the output like this:

y_indices_naive = [1,5,4300,...]

is should be formatted this way:

y_indices_naive = [[1,], [5,] , [4300,],...]

That will make Keras happy and it’ll trained the model as expected.

2reactions

fcholletcommented, Aug 6, 2015

Theano has no support for sparse operations as far as I know (and Keras certainly doesn’t either). So all data will have to be converted to dense arrays at some point. However a 5k-dimensional output space doesn’t seem very large to me.

You can solve your OOM error by one-hot encoding and training batch-by-batch instead of 3M samples at once. Break down your dataset into small batches, and for each batch:

y_batch = np_utils.to_categorical(y_indices_batch, nb_classes=5000)
model.train_on_batch(X_batch, y_batch)

As long as 1) your model fits in memory and 2) your batches are small enough, this will not cause any memory issues.

Top Results From Across the Web

Encoding of categorical variables with high cardinality

In target encoding, categorical features are replaced with the mean target value of each respective category. With this technique, the high ...

How to deal with Features having high cardinality - Kaggle

You can follow the following steps to deal with high cardinality in your data: 1) Check for unique values in your feature. 2)...

Dealing with categorical features with high cardinality - Medium

One very common step in any feature engineering task is converting categorical features into numerical. Categorical data can pose a serious ...

Combating High Cardinality Features in Supervised Machine ...

For categorical variables , the number of possible splits grows non linearly with cardinality. If we are splitting the categorical values into 2 ......

Categorical features with high cardinality: Dealing with ...

Must be remembered, categorical data can pose a serious problem if they have high cardinality i.e too many unique values.