[BUG] Categorify problems when using num_buckets
See original GitHub issueI am updating the Criteo-HugeCTR Notebook example to use the new API: https://github.com/NVIDIA/NVTabular/blob/764b8960f8bb66ec98d785497f62c50c1457efc8/examples/hugectr/criteo-hugectr.ipynb
Previously, I was proc.add_cat_preprocess([ops.Categorify(out_path=stats_path), ops.LambdaOp(op_name="MOD10M", f=lambda col: col % num_buckets)])
for the categoricals.
-
If try to use LambdaOp with the new API:
cat_features = CATEGORICAL_COLUMNS >> Categorify(out_path=stats_path) >> LambdaOp(f=lambda col: col % num_buckets)
I get an empty dict when I try to get the embeddings sizesget_embedding_sizes(workflow)
. -
If I try to use
num_buckets
within Categorifycat_features = CATEGORICAL_COLUMNS >> Categorify(out_path=stats_path, num_buckets=num_buckets)
I get an error due to num_buckets being an int (it seems to be a bug not considering the int case)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-671dd4884770> in <module>
----> 1 embeddings = get_embedding_sizes(workflow)
2 print(embeddings)
/NVTA/nvtabular/ops/categorify.py in get_embedding_sizes(workflow)
387 current = queue.pop()
388 if current.op and hasattr(current.op, "get_embedding_sizes"):
--> 389 output.update(current.op.get_embedding_sizes(current.columns))
390
391 if hasattr(current.op, "get_multihot_columns"):
/NVTA/nvtabular/ops/categorify.py in get_embedding_sizes(self, columns)
357
358 def get_embedding_sizes(self, columns):
--> 359 return _get_embeddings_dask(self.categories, columns, self.num_buckets, self.freq_threshold)
360
361 def get_multihot_columns(self):
/NVTA/nvtabular/ops/categorify.py in _get_embeddings_dask(paths, cat_names, buckets, freq_limit)
414 path = paths.get(col)
415 num_rows = cudf.io.read_parquet_metadata(path)[0] if path else 0
--> 416 if buckets and col in buckets and freq_limit[col] > 1:
417 num_rows += buckets.get(col, 0)
418 if buckets and col in buckets and not freq_limit[col]:
TypeError: argument of type 'int' is not iterable
- As a workaround for 2, I can build a dictionary specifying the num_buckets for every column, this works, but the embedding sizes are not what I was expecting.
{'C1': (10000000, 512),
'C10': (10000000, 512),
'C11': (10000000, 512),
'C12': (10000000, 512),
'C13': (10000000, 512),
'C14': (10000000, 512),
'C15': (10000000, 512),
'C16': (10000000, 512),
'C17': (10000000, 512),
'C18': (10000000, 512),
'C19': (10000000, 512),
'C2': (10000000, 512),
'C20': (10000000, 512),
'C21': (10000000, 512),
'C22': (10000000, 512),
'C23': (10000000, 512),
'C24': (10000000, 512),
'C25': (10000000, 512),
'C26': (10000000, 512),
'C3': (10000000, 512),
'C4': (10000000, 512),
'C5': (10000000, 512),
'C6': (10000000, 512),
'C7': (10000000, 512),
'C8': (10000000, 512),
'C9': (10000000, 512)}
It seems that num_buckets
being 10M is replacing the cardinality for every column. However, I was expecting that only the columns its cardinality is greater than num_buckets
would be limited, and for the rest it would be the same. This is the behaviors I am getting with the old API and Categorify+LamdaOp.
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (14 by maintainers)
Top GitHub Comments
This is how I was using it before
And I was getting the embedding sizes like this
And the result was looking as expected
I have tried using HashBucket, and I am getting the same result than with Categorify.