Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Categorify problems when using num_buckets

See original GitHub issue

@rnyak @benfred

I am updating the Criteo-HugeCTR Notebook example to use the new API: https://github.com/NVIDIA/NVTabular/blob/764b8960f8bb66ec98d785497f62c50c1457efc8/examples/hugectr/criteo-hugectr.ipynb

Previously, I was proc.add_cat_preprocess([ops.Categorify(out_path=stats_path), ops.LambdaOp(op_name="MOD10M", f=lambda col: col % num_buckets)]) for the categoricals.

If try to use LambdaOp with the new API: cat_features = CATEGORICAL_COLUMNS >> Categorify(out_path=stats_path) >> LambdaOp(f=lambda col: col % num_buckets) I get an empty dict when I try to get the embeddings sizes get_embedding_sizes(workflow).
If I try to use num_buckets within Categorify cat_features = CATEGORICAL_COLUMNS >> Categorify(out_path=stats_path, num_buckets=num_buckets) I get an error due to num_buckets being an int (it seems to be a bug not considering the int case)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-671dd4884770> in <module>
----> 1 embeddings = get_embedding_sizes(workflow)
      2 print(embeddings)
/NVTA/nvtabular/ops/categorify.py in get_embedding_sizes(workflow)
    387         current = queue.pop()
    388         if current.op and hasattr(current.op, "get_embedding_sizes"):
--> 389             output.update(current.op.get_embedding_sizes(current.columns))
    390 
    391             if hasattr(current.op, "get_multihot_columns"):
/NVTA/nvtabular/ops/categorify.py in get_embedding_sizes(self, columns)
    357 
    358     def get_embedding_sizes(self, columns):
--> 359         return _get_embeddings_dask(self.categories, columns, self.num_buckets, self.freq_threshold)
    360 
    361     def get_multihot_columns(self):
/NVTA/nvtabular/ops/categorify.py in _get_embeddings_dask(paths, cat_names, buckets, freq_limit)
    414         path = paths.get(col)
    415         num_rows = cudf.io.read_parquet_metadata(path)[0] if path else 0
--> 416         if buckets and col in buckets and freq_limit[col] > 1:
    417             num_rows += buckets.get(col, 0)
    418         if buckets and col in buckets and not freq_limit[col]:
TypeError: argument of type 'int' is not iterable

As a workaround for 2, I can build a dictionary specifying the num_buckets for every column, this works, but the embedding sizes are not what I was expecting.

{'C1': (10000000, 512),
 'C10': (10000000, 512),
 'C11': (10000000, 512),
 'C12': (10000000, 512),
 'C13': (10000000, 512),
 'C14': (10000000, 512),
 'C15': (10000000, 512),
 'C16': (10000000, 512),
 'C17': (10000000, 512),
 'C18': (10000000, 512),
 'C19': (10000000, 512),
 'C2': (10000000, 512),
 'C20': (10000000, 512),
 'C21': (10000000, 512),
 'C22': (10000000, 512),
 'C23': (10000000, 512),
 'C24': (10000000, 512),
 'C25': (10000000, 512),
 'C26': (10000000, 512),
 'C3': (10000000, 512),
 'C4': (10000000, 512),
 'C5': (10000000, 512),
 'C6': (10000000, 512),
 'C7': (10000000, 512),
 'C8': (10000000, 512),
 'C9': (10000000, 512)}

It seems that num_buckets being 10M is replacing the cardinality for every column. However, I was expecting that only the columns its cardinality is greater than num_buckets would be limited, and for the rest it would be the same. This is the behaviors I am getting with the old API and Categorify+LamdaOp.

Issue Analytics

State:
Created 3 years ago
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

albert17commented, Feb 1, 2021

This is how I was using it before

num_buckets = 10000000
proc.add_cat_preprocess([ops.Categorify(out_path=stats_path), ops.LambdaOp(op_name="MOD10M", f=lambda col: col % num_buckets)])

And I was getting the embedding sizes like this

embeddings = [c[0] for c in ops.get_embedding_sizes(workflow).values()]
embeddings = np.clip(a=embeddings, a_min=None, a_max=num_buckets).tolist()
print(embeddings)

And the result was looking as expected


[10000000, 10000000, 3014529, 400781, 11, 2209, 11869, 148, 4, 977, 15, 38713, 10000000, 10000000, 10000000, 584616, 12883, 109, 37, 17177, 7425, 20266, 4, 7085, 1535, 64]

1reaction

albert17commented, Feb 1, 2021

I have tried using HashBucket, and I am getting the same result than with Categorify.

{'C1': (10000000, 512),
 'C10': (10000000, 512),
 'C11': (10000000, 512),
 'C12': (10000000, 512),
 'C13': (10000000, 512),
 'C14': (10000000, 512),
 'C15': (10000000, 512),
 'C16': (10000000, 512),
 'C17': (10000000, 512),
 'C18': (10000000, 512),
 'C19': (10000000, 512),
 'C2': (10000000, 512),
 'C20': (10000000, 512),
 'C21': (10000000, 512),
 'C22': (10000000, 512),
 'C23': (10000000, 512),
 'C24': (10000000, 512),
 'C25': (10000000, 512),
 'C26': (10000000, 512),
 'C3': (10000000, 512),
 'C4': (10000000, 512),
 'C5': (10000000, 512),
 'C6': (10000000, 512),
 'C7': (10000000, 512),
 'C8': (10000000, 512),
 'C9': (10000000, 512)}