question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High cardinality categorical outputs

See original GitHub issue

Howdy,

I have a dataset where the output is one of 5k categories. I also have millions of samples. The naive representation of y_indices_naive (the outputs) is:

[1,5,4300,...]

But it seems that Keras/Theano require one-hot encodings of the output.

Problem is, np_utils.to_categorical(y_indices_naive) causes an out-of-memory error because then I need a 3mil x 3k matrix.

Is there any way to get Keras to accept y_indices_naive without converting it to one-hot? I would be happy to add some code if someone would point out how to best do it.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
paulomalvarcommented, Apr 28, 2017

The trick to fix issue with the error expecting a 3D input when using sparse_categorical_crossentropy is to format outputs in a sparse 3-dimensional way. So instead of formatting the output like this:

y_indices_naive = [1,5,4300,...]

is should be formatted this way:

y_indices_naive = [[1,], [5,] , [4300,],...]

That will make Keras happy and it’ll trained the model as expected.

2reactions
fcholletcommented, Aug 6, 2015

Theano has no support for sparse operations as far as I know (and Keras certainly doesn’t either). So all data will have to be converted to dense arrays at some point. However a 5k-dimensional output space doesn’t seem very large to me.

You can solve your OOM error by one-hot encoding and training batch-by-batch instead of 3M samples at once. Break down your dataset into small batches, and for each batch:

y_batch = np_utils.to_categorical(y_indices_batch, nb_classes=5000)
model.train_on_batch(X_batch, y_batch)

As long as 1) your model fits in memory and 2) your batches are small enough, this will not cause any memory issues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Encoding of categorical variables with high cardinality
In target encoding, categorical features are replaced with the mean target value of each respective category. With this technique, the high ...
Read more >
How to deal with Features having high cardinality - Kaggle
You can follow the following steps to deal with high cardinality in your data: 1) Check for unique values in your feature. 2)...
Read more >
Dealing with categorical features with high cardinality - Medium
One very common step in any feature engineering task is converting categorical features into numerical. Categorical data can pose a serious ...
Read more >
Combating High Cardinality Features in Supervised Machine ...
For categorical variables , the number of possible splits grows non linearly with cardinality. If we are splitting the categorical values into 2 ......
Read more >
Categorical features with high cardinality: Dealing with ...
Must be remembered, categorical data can pose a serious problem if they have high cardinality i.e too many unique values.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found