Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] Add options to reserve categorical indices in the Categorify() op

See original GitHub issue

Is your feature request related to a problem? Please describe. Some pipelines trust that categorical features will have some reserved values after preprocessing. For example, for sequential features (e.g. session-based recommendation, time series, text data) it is important to reserve a value for padding (e.g. 0) in the model side, as not all sequences might have the same length. It is also important to reserve a value (e.g. 1) to map Out-Of-Vocabulary (OOV) categorical values that might appear in the transform() (e.g. during evaluation or inference)

Describe the solution you’d like Create one argument for Categorify() op to set the desired oov_index (for which all OOV values will be mapped during the preproc) and another argument start_index to set the first value that should be used to encode known values (e.g. 2). We can also have a similar option for null values (null_index), so that they are mapped to a known index. In this case, if we call Categorify(oov_index=1, null_index=2, start_index=3), the index 0 could be safely used for padding for example. I think we don’t need an padding_index, because it is specific to sequential features. It is better to give users flexibility to reserve a number of categorical values by using start_index

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

benfredcommented, Aug 31, 2021

I think we should simplify, and just specify a start_index with a default value of 0.

The implementation of Categorify has gotten pretty complex - so I thought it might help to put some pointers on where to get started with this:

There are two phases in Categorify: the ‘fit’ which calculates the mapping of input label to categorical id, and the ‘transform’ which applies the mapping. We’ll need to change both of these to handle this option
For the ‘fit’ stage, we will need to store this option in the ‘FitOptions’ struct and then use in the _write_uniques function https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L889-L904 .
The ‘transform’ code needs updated in the _encode function to set the unknown values to the start_index specified by the user. The simple case where we aren’t doing any OOV hashing will involve updating these lines https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L1110-L1116 , but this will get more complicated with the hash bucket code. It might be worth just trying to get this working without hashing first, and then tackle hashing next.
There is currently a ‘na_sentinel’ parameter on the categorify op https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L146-L147 . Unfortunately I don’t think this works appropriately right now, and probably should be removed as an option from Categorify.

0reactions

lesnikowcommented, Sep 2, 2021

@gabrielspmoreira I did not see how to add you as a reviewer for my draft PR, but when you have a chance, would you be able to see whether the implemented tests test your intended functionality? This draft PR consists of the commit 642e5f503e33c8440d6a186c7a9a8c29cc33618b.

Top Results From Across the Web

Improving Fit Indices in Structural Equation Modeling with ...

In this article, I first explain why the current computations of categorical fit indices lead to this problematic behavior. I then propose and...

9 Categorical | Data Wrangling with R

In R, we specify which variables are factors when we create and store them - in ... The basic constructor function for data...

Categorical data — pandas 0.25.0 documentation

Categoricals are a pandas data type corresponding to categorical variables ... Categorical Series or columns in a DataFrame can be created in several...

Handling Categorical Data in Python Tutorial - DataCamp

Learn the common tricks to handle CATEGORICAL data, such as converting to numeric PANDAS or missing data and preprocess it to build MACHINE...

Chapter 16 Analyzing Experiments with Categorical Outcomes

We will need formal statistical analysis to test hypotheses about the population based on the information in our sample. Other information that may...