[FEA] Multi-hot categorical support
See original GitHub issueIssue by alecgunny Wednesday May 20, 2020 at 17:25 GMT _Originally opened as https://github.com/rapidsai/recsys/issues/155_
Multi-hot categorical features are a ubiquitous and important element of all production deep recommendation workflows. The ability to robustly learn dense representations of these features in an end-to-end manner represents one of the main advantages of deep learning approaches. Supporting representations of and transformations on these features is critical to broader adoption, and complex enough to possibly warrant a dedicated milestone. I’m creating this issue to start the discussion of how/where we want to start mapping out these solutions.
The most obvious first step to me is to decide on a representation pattern, as this will determine how we build op support on those representations. Some options include
- Dense dataframes padded with zeroes to some max number of elements
- Sparse dataframe version of the above
- Ragged-array series elements in dataframes
Option 1 would require the least overhead to extend support to, but obviously wastes memory and could be prohibitive for features that have category counts ranging over orders of magnitude (as is common). It also requires users to specify the max number of elements beforehand, which may not be known (unless we give them an op to compute it) and could change over time, potentially wasting memory or throwing out good data.
Options 2 and 3 would probably end up being pretty similar (I would imagine that specifying a max number of elements would end up being necessary for option 3), but 3 feels cleaner as it keeps features localized to individual columns (instead of spread out over many sparse columns) and keeps us from having to mix sparse and dense dataframes. It’s also technically more memory efficient, since instead of each row requiring N (row_index, column_index, value)
tuples, where N is the number of categories taken on by a given sample, you just need the array of N values and a single offset
int.
One thing worth considering, though, is that if repeated categories are common, the ragged representation can become more memory intensive, since the value
int in the sparse tuple would represent the k number of times that category occurs, while you would need k ints in the ragged representation for each time the category occurred.
One deciding factor in all of this is how we expect the APIs for multi-hot embeddings to be implemented. One method is to implement essentially a sparse matrix multiplication against the multi-hot encoded vector for each sample (with possibly some normalization to implement things like mean aggregation instead of just sum), which will be more efficient in the case of commonly repeated categories and, obviously, lends itself to the sparse representation. The other is to just perform a regular lookup on all the values and aggregate in segments using the offsets
, which will lend itself to the ragged representation.
Long term, offering options for both representation and embedding choices will probably be most valuable to users. In the short term, it’s worth picking one and starting to work on pushing for cudf support for it so we can begin to build op support. My personal vote is the ragged array option, since it will already be consistent with the PyTorch EmbeddingBag API, which we can port to TensorFlow, and seems like it would require the least overhead to support (since the sparse option seems like an extension of its functionality). Either way, even if it’s not SOL in all cases, having one version accelerated is better than the existing options.
This requires some support in cudf to add list types:
- Add nested list types to cudf
- Read parquet files with nested lists
- Write parquet files with nested lists
- Python API support for for list dtypes
- Read/Write access to list values in a cudf dataframe (for instance to hash / categorify the elements of a list
NVTabular changes include:
- Support for list types in Categorify op
- Support list types in Hashing op
- Tensorflow dataloader suport
- Pytorch dataloader
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
We’re planning on using option ‘3’ above: using cudf list columns to represent multihot categoricals for each sample. Cudf list columns are still in progress, but there is basic python API access now - giving us the values for all rows and the offsets into the values array for each row:
Apologies; I didn’t intend to link to users in that comment. Thanks for the humorous response. 😃