question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] Add options to reserve categorical indices in the Categorify() op

See original GitHub issue

Is your feature request related to a problem? Please describe. Some pipelines trust that categorical features will have some reserved values after preprocessing. For example, for sequential features (e.g. session-based recommendation, time series, text data) it is important to reserve a value for padding (e.g. 0) in the model side, as not all sequences might have the same length. It is also important to reserve a value (e.g. 1) to map Out-Of-Vocabulary (OOV) categorical values that might appear in the transform() (e.g. during evaluation or inference)

Describe the solution you’d like Create one argument for Categorify() op to set the desired oov_index (for which all OOV values will be mapped during the preproc) and another argument start_index to set the first value that should be used to encode known values (e.g. 2). We can also have a similar option for null values (null_index), so that they are mapped to a known index. In this case, if we call Categorify(oov_index=1, null_index=2, start_index=3), the index 0 could be safely used for padding for example. I think we don’t need an padding_index, because it is specific to sequential features. It is better to give users flexibility to reserve a number of categorical values by using start_index

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
benfredcommented, Aug 31, 2021

I think we should simplify, and just specify a start_index with a default value of 0.

The implementation of Categorify has gotten pretty complex - so I thought it might help to put some pointers on where to get started with this:

0reactions
lesnikowcommented, Sep 2, 2021

@gabrielspmoreira I did not see how to add you as a reviewer for my draft PR, but when you have a chance, would you be able to see whether the implemented tests test your intended functionality? This draft PR consists of the commit 642e5f503e33c8440d6a186c7a9a8c29cc33618b.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Improving Fit Indices in Structural Equation Modeling with ...
In this article, I first explain why the current computations of categorical fit indices lead to this problematic behavior. I then propose and...
Read more >
9 Categorical | Data Wrangling with R
In R, we specify which variables are factors when we create and store them - in ... The basic constructor function for data...
Read more >
Categorical data — pandas 0.25.0 documentation
Categoricals are a pandas data type corresponding to categorical variables ... Categorical Series or columns in a DataFrame can be created in several...
Read more >
Handling Categorical Data in Python Tutorial - DataCamp
Learn the common tricks to handle CATEGORICAL data, such as converting to numeric PANDAS or missing data and preprocess it to build MACHINE...
Read more >
Chapter 16 Analyzing Experiments with Categorical Outcomes
We will need formal statistical analysis to test hypotheses about the population based on the information in our sample. Other information that may...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found