question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] Multi-hot categorical support

See original GitHub issue

Issue by alecgunny Wednesday May 20, 2020 at 17:25 GMT _Originally opened as https://github.com/rapidsai/recsys/issues/155_


Multi-hot categorical features are a ubiquitous and important element of all production deep recommendation workflows. The ability to robustly learn dense representations of these features in an end-to-end manner represents one of the main advantages of deep learning approaches. Supporting representations of and transformations on these features is critical to broader adoption, and complex enough to possibly warrant a dedicated milestone. I’m creating this issue to start the discussion of how/where we want to start mapping out these solutions.

The most obvious first step to me is to decide on a representation pattern, as this will determine how we build op support on those representations. Some options include

  1. Dense dataframes padded with zeroes to some max number of elements
  2. Sparse dataframe version of the above
  3. Ragged-array series elements in dataframes

Option 1 would require the least overhead to extend support to, but obviously wastes memory and could be prohibitive for features that have category counts ranging over orders of magnitude (as is common). It also requires users to specify the max number of elements beforehand, which may not be known (unless we give them an op to compute it) and could change over time, potentially wasting memory or throwing out good data.

Options 2 and 3 would probably end up being pretty similar (I would imagine that specifying a max number of elements would end up being necessary for option 3), but 3 feels cleaner as it keeps features localized to individual columns (instead of spread out over many sparse columns) and keeps us from having to mix sparse and dense dataframes. It’s also technically more memory efficient, since instead of each row requiring N (row_index, column_index, value) tuples, where N is the number of categories taken on by a given sample, you just need the array of N values and a single offset int.

One thing worth considering, though, is that if repeated categories are common, the ragged representation can become more memory intensive, since the value int in the sparse tuple would represent the k number of times that category occurs, while you would need k ints in the ragged representation for each time the category occurred.

One deciding factor in all of this is how we expect the APIs for multi-hot embeddings to be implemented. One method is to implement essentially a sparse matrix multiplication against the multi-hot encoded vector for each sample (with possibly some normalization to implement things like mean aggregation instead of just sum), which will be more efficient in the case of commonly repeated categories and, obviously, lends itself to the sparse representation. The other is to just perform a regular lookup on all the values and aggregate in segments using the offsets, which will lend itself to the ragged representation.

Long term, offering options for both representation and embedding choices will probably be most valuable to users. In the short term, it’s worth picking one and starting to work on pushing for cudf support for it so we can begin to build op support. My personal vote is the ragged array option, since it will already be consistent with the PyTorch EmbeddingBag API, which we can port to TensorFlow, and seems like it would require the least overhead to support (since the sparse option seems like an extension of its functionality). Either way, even if it’s not SOL in all cases, having one version accelerated is better than the existing options.

This requires some support in cudf to add list types:

  • Add nested list types to cudf
  • Read parquet files with nested lists
  • Write parquet files with nested lists
  • Python API support for for list dtypes
  • Read/Write access to list values in a cudf dataframe (for instance to hash / categorify the elements of a list

NVTabular changes include:

  • Support for list types in Categorify op
  • Support list types in Hashing op
  • Tensorflow dataloader suport
  • Pytorch dataloader

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
benfredcommented, Jul 23, 2020

We’re planning on using option ‘3’ above: using cudf list columns to represent multihot categoricals for each sample. Cudf list columns are still in progress, but there is basic python API access now - giving us the values for all rows and the offsets into the values array for each row:

 df = cudf.DataFrame({"a": [[1], [], []], "b": [[1, 2, 3], [4,5], [6]]})              

# cudf series with the values for all rows: 1,2,3,4,5,6
df.b.list.leaves
# returns the offsets (As a cupy array). might be a better way of doing this evenutally.
df.b._column.offsets.values
1reaction
EvenOldridgecommented, Jun 8, 2020

Apologies; I didn’t intend to link to users in that comment. Thanks for the humorous response. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

[FEA] Multi-hot categorical support · Issue #43 - GitHub
Multi-hot categorical features are a ubiquitous and important element of all production deep recommendation workflows.
Read more >
How to Perform One-Hot Encoding For Multi Categorical ...
Here's how you can one-hot encoding for multi-categorical variables. ... So, here we handling categorical features by One Hot Encoding, ...
Read more >
What is an indicator column / multi-hot representation of a ...
Represents multi-hot representation of given categorical column. I haven't been able to find anything by googling these terms. The example in ...
Read more >
How to Perform Feature Selection with Categorical Data
How to evaluate the importance of categorical features using the chi-squared and mutual information statistics. How to perform feature ...
Read more >
CategoryEncoding layer - Keras
A preprocessing layer which encodes integer features. This layer provides options for condensing data into a categorical ... Multi-hot encoding data.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found