question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] get_embedding_sizes errors with TargetEncoding

See original GitHub issue

Describe the bug nvt.ops.get_embedding_sizes returns a KeyError when a nvt.ops.TargetEncoding is included in the pipeline. I believe this error occurs because TargetEncoding.output_tags returns [Tags.CATEGORICAL] instead of [Tags.CONTINUOUS].

The error is:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 get_embedding_sizes(processor)

File /usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py:609, in get_embedding_sizes(source, output_dtypes)
    606     multihot_columns.add(col_name)
    608 embeddings_sizes = col_schema.properties.get("embedding_sizes", {})
--> 609 cardinality = embeddings_sizes["cardinality"]
    610 dimensions = embeddings_sizes["dimension"]
    611 output[col_name] = (cardinality, dimensions)

KeyError: 'cardinality'

Steps/Code to reproduce bug

import nvtabular as nvt
from nvtabular.ops import TargetEncoding, get_embedding_sizes, LambdaOp
from merlin.core import dispatch 
import dask_cudf

LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels, kfold=5, out_dtype="float32")
processor = nvt.Workflow(target_encode)

df = dispatch.make_df({
    'label1': list(range(100)),
    'label2': list(range(100)),
    'cat1': list(range(100)),
    'cat2': list(range(100)),
    'cat3': list(range(100))
})
gdf = nvt.Dataset(dask_cudf.from_cudf(df, npartitions=1))
processor.fit(gdf)

get_embedding_sizes(processor)

Expected behavior Call get_embedding_sizes on a pipeline that includes a TargetEncoding op without an error.

Environment details (please complete the following information):

  • Environment location: bare metal
  • Method of NVTabular install: Docker (nvcr.io/nvidia/merlin/merlin-pytorch:22.07)

Additional context Same issue as #1359

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rnyakcommented, Aug 31, 2022

@maxmealy hello thanks for reporting the issue. We will look into that. As a workaround if you add Normalize() op at the end of TargeEncoding line, you can return get_embedding_sizes(processor) as an empty dictionary, which is expected because you do not have any Categorify op in your pipeline. you can do that


LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels , kfold=5, out_dtype="float32") >> Normalize() 
processor = nvt.Workflow(target_encode)

dataset = nvt.Dataset(df)
processed = processor.fit(dataset)
get_embedding_sizes(processor)

OR if you want get_embedding_sizes to return something, you can do something like that:

LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels , kfold=5, out_dtype="float32") >> Normalize() 

cat_cols = ['cat1', 'cat2', 'cat3'] >> Categorify()

processor = nvt.Workflow(cat_cols + target_encode)

dataset = nvt.Dataset(df)
processed = processor.fit(dataset)
get_embedding_sizes(processor)

Hope this helps.

0reactions
rnyakcommented, Nov 30, 2022

@TSienki you need Categorify to be able to generate embedding sizes. if you do the following you will get the embedding dim sizes. I am gonna open a new bug ticket to get a fix for Bucketize to return empty {} instead of an error.

gdf = nvt.Dataset(df)

buckets = [1, 5, 20, 50]
after_bucketize = ["to_bucket"] >> nvt.ops.Bucketize({"to_bucket": buckets})
after_bucketize_cat = after_bucketize >> nvt.ops.Categorify()
processor = nvt.Workflow(after_bucketize_cat)

gdf_transformed = processor.fit_transform(gdf)
nvt.ops.get_embedding_sizes(processor)
Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] Target Encoding Key Error · Issue #401 - GitHub
Target Encoding fails to find the target column. Environment details (please complete the following information):. Environment location: [Bare- ...
Read more >
Fix error 7448: Target Encoder not found - directprint.io
Fix error 7448: Target Encoder not found · 1) Navigate to the printer list · 2) Select 'Actions' -> 'Edit' for the printer...
Read more >
Error: object 'h2o.targetencoder' not found - RStudio Community
I want to use: h2o.targetencoder function, then I do: > install.packages("h2o") > library("h2o") > h2o.targetencoder. but I get the following error:
Read more >
TargetEncoder (`from category_encoders`) in scikit-learn ...
I'm using target encoding on some features in my dataset. My full pipeline is as such: from sklearn.compose import ColumnTransformer from ...
Read more >
Target-encoding Categorical Variables | by Vinícius Trevisan
However, there are important issues that you need to keep in mind when using that. One really important effect is the Target Leakage....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found