[BUG] get_embedding_sizes errors with TargetEncoding
See original GitHub issueDescribe the bug
nvt.ops.get_embedding_sizes
returns a KeyError
when a nvt.ops.TargetEncoding
is included in the pipeline. I believe this error occurs because TargetEncoding.output_tags
returns [Tags.CATEGORICAL]
instead of [Tags.CONTINUOUS]
.
The error is:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 get_embedding_sizes(processor)
File /usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py:609, in get_embedding_sizes(source, output_dtypes)
606 multihot_columns.add(col_name)
608 embeddings_sizes = col_schema.properties.get("embedding_sizes", {})
--> 609 cardinality = embeddings_sizes["cardinality"]
610 dimensions = embeddings_sizes["dimension"]
611 output[col_name] = (cardinality, dimensions)
KeyError: 'cardinality'
Steps/Code to reproduce bug
import nvtabular as nvt
from nvtabular.ops import TargetEncoding, get_embedding_sizes, LambdaOp
from merlin.core import dispatch
import dask_cudf
LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels, kfold=5, out_dtype="float32")
processor = nvt.Workflow(target_encode)
df = dispatch.make_df({
'label1': list(range(100)),
'label2': list(range(100)),
'cat1': list(range(100)),
'cat2': list(range(100)),
'cat3': list(range(100))
})
gdf = nvt.Dataset(dask_cudf.from_cudf(df, npartitions=1))
processor.fit(gdf)
get_embedding_sizes(processor)
Expected behavior
Call get_embedding_sizes
on a pipeline that includes a TargetEncoding
op without an error.
Environment details (please complete the following information):
- Environment location: bare metal
- Method of NVTabular install: Docker (nvcr.io/nvidia/merlin/merlin-pytorch:22.07)
Additional context Same issue as #1359
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
[BUG] Target Encoding Key Error · Issue #401 - GitHub
Target Encoding fails to find the target column. Environment details (please complete the following information):. Environment location: [Bare- ...
Read more >Fix error 7448: Target Encoder not found - directprint.io
Fix error 7448: Target Encoder not found · 1) Navigate to the printer list · 2) Select 'Actions' -> 'Edit' for the printer...
Read more >Error: object 'h2o.targetencoder' not found - RStudio Community
I want to use: h2o.targetencoder function, then I do: > install.packages("h2o") > library("h2o") > h2o.targetencoder. but I get the following error:
Read more >TargetEncoder (`from category_encoders`) in scikit-learn ...
I'm using target encoding on some features in my dataset. My full pipeline is as such: from sklearn.compose import ColumnTransformer from ...
Read more >Target-encoding Categorical Variables | by Vinícius Trevisan
However, there are important issues that you need to keep in mind when using that. One really important effect is the Target Leakage....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@maxmealy hello thanks for reporting the issue. We will look into that. As a workaround if you add
Normalize()
op at the end of TargeEncoding line, you can returnget_embedding_sizes(processor)
as an empty dictionary, which is expected because you do not have anyCategorify
op in your pipeline. you can do thatOR if you want get_embedding_sizes to return something, you can do something like that:
Hope this helps.
@TSienki you need Categorify to be able to generate embedding sizes. if you do the following you will get the embedding dim sizes. I am gonna open a new bug ticket to get a fix for Bucketize to return empty
{}
instead of an error.