Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] Multi-hot categorical support

See original GitHub issue

Issue by alecgunny Wednesday May 20, 2020 at 17:25 GMT _Originally opened as https://github.com/rapidsai/recsys/issues/155_

Multi-hot categorical features are a ubiquitous and important element of all production deep recommendation workflows. The ability to robustly learn dense representations of these features in an end-to-end manner represents one of the main advantages of deep learning approaches. Supporting representations of and transformations on these features is critical to broader adoption, and complex enough to possibly warrant a dedicated milestone. I’m creating this issue to start the discussion of how/where we want to start mapping out these solutions.

The most obvious first step to me is to decide on a representation pattern, as this will determine how we build op support on those representations. Some options include

Dense dataframes padded with zeroes to some max number of elements
Sparse dataframe version of the above
Ragged-array series elements in dataframes

Option 1 would require the least overhead to extend support to, but obviously wastes memory and could be prohibitive for features that have category counts ranging over orders of magnitude (as is common). It also requires users to specify the max number of elements beforehand, which may not be known (unless we give them an op to compute it) and could change over time, potentially wasting memory or throwing out good data.

Options 2 and 3 would probably end up being pretty similar (I would imagine that specifying a max number of elements would end up being necessary for option 3), but 3 feels cleaner as it keeps features localized to individual columns (instead of spread out over many sparse columns) and keeps us from having to mix sparse and dense dataframes. It’s also technically more memory efficient, since instead of each row requiring N (row_index, column_index, value) tuples, where N is the number of categories taken on by a given sample, you just need the array of N values and a single offset int.

One thing worth considering, though, is that if repeated categories are common, the ragged representation can become more memory intensive, since the value int in the sparse tuple would represent the k number of times that category occurs, while you would need k ints in the ragged representation for each time the category occurred.

One deciding factor in all of this is how we expect the APIs for multi-hot embeddings to be implemented. One method is to implement essentially a sparse matrix multiplication against the multi-hot encoded vector for each sample (with possibly some normalization to implement things like mean aggregation instead of just sum), which will be more efficient in the case of commonly repeated categories and, obviously, lends itself to the sparse representation. The other is to just perform a regular lookup on all the values and aggregate in segments using the offsets, which will lend itself to the ragged representation.

Long term, offering options for both representation and embedding choices will probably be most valuable to users. In the short term, it’s worth picking one and starting to work on pushing for cudf support for it so we can begin to build op support. My personal vote is the ragged array option, since it will already be consistent with the PyTorch EmbeddingBag API, which we can port to TensorFlow, and seems like it would require the least overhead to support (since the sparse option seems like an extension of its functionality). Either way, even if it’s not SOL in all cases, having one version accelerated is better than the existing options.

This requires some support in cudf to add list types:

Add nested list types to cudf
Read parquet files with nested lists
Write parquet files with nested lists
Python API support for for list dtypes
Read/Write access to list values in a cudf dataframe (for instance to hash / categorify the elements of a list

NVTabular changes include:

Support for list types in Categorify op
Support list types in Hashing op
Tensorflow dataloader suport
Pytorch dataloader

Issue Analytics

State:
Created 3 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

benfredcommented, Jul 23, 2020

We’re planning on using option ‘3’ above: using cudf list columns to represent multihot categoricals for each sample. Cudf list columns are still in progress, but there is basic python API access now - giving us the values for all rows and the offsets into the values array for each row:

 df = cudf.DataFrame({"a": [[1], [], []], "b": [[1, 2, 3], [4,5], [6]]})              

# cudf series with the values for all rows: 1,2,3,4,5,6
df.b.list.leaves
# returns the offsets (As a cupy array). might be a better way of doing this evenutally.
df.b._column.offsets.values

1reaction

EvenOldridgecommented, Jun 8, 2020

Apologies; I didn’t intend to link to users in that comment. Thanks for the humorous response. 😃