[FEA] Add options to reserve categorical indices in the Categorify() op
See original GitHub issueIs your feature request related to a problem? Please describe. Some pipelines trust that categorical features will have some reserved values after preprocessing. For example, for sequential features (e.g. session-based recommendation, time series, text data) it is important to reserve a value for padding (e.g. 0) in the model side, as not all sequences might have the same length. It is also important to reserve a value (e.g. 1) to map Out-Of-Vocabulary (OOV) categorical values that might appear in the transform() (e.g. during evaluation or inference)
Describe the solution you’d like
Create one argument for Categorify()
op to set the desired oov_index
(for which all OOV values will be mapped during the preproc) and another argument start_index
to set the first value that should be used to encode known values (e.g. 2). We can also have a similar option for null values (null_index
), so that they are mapped to a known index.
In this case, if we call Categorify(oov_index=1, null_index=2, start_index=3)
, the index 0 could be safely used for padding for example.
I think we don’t need an padding_index
, because it is specific to sequential features. It is better to give users flexibility to reserve a number of categorical values by using start_index
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
I think we should simplify, and just specify a start_index with a default value of 0.
The implementation of Categorify has gotten pretty complex - so I thought it might help to put some pointers on where to get started with this:
_encode
function to set the unknown values to thestart_index
specified by the user. The simple case where we aren’t doing any OOV hashing will involve updating these lines https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L1110-L1116 , but this will get more complicated with the hash bucket code. It might be worth just trying to get this working without hashing first, and then tackle hashing next.@gabrielspmoreira I did not see how to add you as a reviewer for my draft PR, but when you have a chance, would you be able to see whether the implemented tests test your intended functionality? This draft PR consists of the commit 642e5f503e33c8440d6a186c7a9a8c29cc33618b.