Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] get_embedding_sizes errors with TargetEncoding

See original GitHub issue

Describe the bug nvt.ops.get_embedding_sizes returns a KeyError when a nvt.ops.TargetEncoding is included in the pipeline. I believe this error occurs because TargetEncoding.output_tags returns [Tags.CATEGORICAL] instead of [Tags.CONTINUOUS].

The error is:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 get_embedding_sizes(processor)

File /usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py:609, in get_embedding_sizes(source, output_dtypes)
    606     multihot_columns.add(col_name)
    608 embeddings_sizes = col_schema.properties.get("embedding_sizes", {})
--> 609 cardinality = embeddings_sizes["cardinality"]
    610 dimensions = embeddings_sizes["dimension"]
    611 output[col_name] = (cardinality, dimensions)

KeyError: 'cardinality'

Steps/Code to reproduce bug

import nvtabular as nvt
from nvtabular.ops import TargetEncoding, get_embedding_sizes, LambdaOp
from merlin.core import dispatch 
import dask_cudf

LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels, kfold=5, out_dtype="float32")
processor = nvt.Workflow(target_encode)

df = dispatch.make_df({
    'label1': list(range(100)),
    'label2': list(range(100)),
    'cat1': list(range(100)),
    'cat2': list(range(100)),
    'cat3': list(range(100))
})
gdf = nvt.Dataset(dask_cudf.from_cudf(df, npartitions=1))
processor.fit(gdf)

get_embedding_sizes(processor)

Expected behavior Call get_embedding_sizes on a pipeline that includes a TargetEncoding op without an error.

Environment details (please complete the following information):

Environment location: bare metal
Method of NVTabular install: Docker (nvcr.io/nvidia/merlin/merlin-pytorch:22.07)

Additional context Same issue as #1359

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

rnyakcommented, Aug 31, 2022

@maxmealy hello thanks for reporting the issue. We will look into that. As a workaround if you add Normalize() op at the end of TargeEncoding line, you can return get_embedding_sizes(processor) as an empty dictionary, which is expected because you do not have any Categorify op in your pipeline. you can do that


LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels , kfold=5, out_dtype="float32") >> Normalize() 
processor = nvt.Workflow(target_encode)

dataset = nvt.Dataset(df)
processed = processor.fit(dataset)
get_embedding_sizes(processor)

OR if you want get_embedding_sizes to return something, you can do something like that:

LABEL_COLUMNS = ['label1', 'label2']
labels = ['label1', 'label2'] >> LambdaOp(lambda col: (col>0).astype('int8'))
target_encode = ['cat1', 'cat2', ['cat2','cat3']] >> TargetEncoding(labels , kfold=5, out_dtype="float32") >> Normalize() 

cat_cols = ['cat1', 'cat2', 'cat3'] >> Categorify()

processor = nvt.Workflow(cat_cols + target_encode)

dataset = nvt.Dataset(df)
processed = processor.fit(dataset)
get_embedding_sizes(processor)

Hope this helps.

0reactions

rnyakcommented, Nov 30, 2022

@TSienki you need Categorify to be able to generate embedding sizes. if you do the following you will get the embedding dim sizes. I am gonna open a new bug ticket to get a fix for Bucketize to return empty {} instead of an error.

gdf = nvt.Dataset(df)

buckets = [1, 5, 20, 50]
after_bucketize = ["to_bucket"] >> nvt.ops.Bucketize({"to_bucket": buckets})
after_bucketize_cat = after_bucketize >> nvt.ops.Categorify()
processor = nvt.Workflow(after_bucketize_cat)

gdf_transformed = processor.fit_transform(gdf)
nvt.ops.get_embedding_sizes(processor)