question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Categorify problems when using num_buckets

See original GitHub issue

@rnyak @benfred

I am updating the Criteo-HugeCTR Notebook example to use the new API: https://github.com/NVIDIA/NVTabular/blob/764b8960f8bb66ec98d785497f62c50c1457efc8/examples/hugectr/criteo-hugectr.ipynb

Previously, I was proc.add_cat_preprocess([ops.Categorify(out_path=stats_path), ops.LambdaOp(op_name="MOD10M", f=lambda col: col % num_buckets)]) for the categoricals.

  1. If try to use LambdaOp with the new API: cat_features = CATEGORICAL_COLUMNS >> Categorify(out_path=stats_path) >> LambdaOp(f=lambda col: col % num_buckets) I get an empty dict when I try to get the embeddings sizes get_embedding_sizes(workflow).

  2. If I try to use num_buckets within Categorify cat_features = CATEGORICAL_COLUMNS >> Categorify(out_path=stats_path, num_buckets=num_buckets) I get an error due to num_buckets being an int (it seems to be a bug not considering the int case)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-671dd4884770> in <module>
----> 1 embeddings = get_embedding_sizes(workflow)
      2 print(embeddings)
/NVTA/nvtabular/ops/categorify.py in get_embedding_sizes(workflow)
    387         current = queue.pop()
    388         if current.op and hasattr(current.op, "get_embedding_sizes"):
--> 389             output.update(current.op.get_embedding_sizes(current.columns))
    390 
    391             if hasattr(current.op, "get_multihot_columns"):
/NVTA/nvtabular/ops/categorify.py in get_embedding_sizes(self, columns)
    357 
    358     def get_embedding_sizes(self, columns):
--> 359         return _get_embeddings_dask(self.categories, columns, self.num_buckets, self.freq_threshold)
    360 
    361     def get_multihot_columns(self):
/NVTA/nvtabular/ops/categorify.py in _get_embeddings_dask(paths, cat_names, buckets, freq_limit)
    414         path = paths.get(col)
    415         num_rows = cudf.io.read_parquet_metadata(path)[0] if path else 0
--> 416         if buckets and col in buckets and freq_limit[col] > 1:
    417             num_rows += buckets.get(col, 0)
    418         if buckets and col in buckets and not freq_limit[col]:
TypeError: argument of type 'int' is not iterable
  1. As a workaround for 2, I can build a dictionary specifying the num_buckets for every column, this works, but the embedding sizes are not what I was expecting.
{'C1': (10000000, 512),
 'C10': (10000000, 512),
 'C11': (10000000, 512),
 'C12': (10000000, 512),
 'C13': (10000000, 512),
 'C14': (10000000, 512),
 'C15': (10000000, 512),
 'C16': (10000000, 512),
 'C17': (10000000, 512),
 'C18': (10000000, 512),
 'C19': (10000000, 512),
 'C2': (10000000, 512),
 'C20': (10000000, 512),
 'C21': (10000000, 512),
 'C22': (10000000, 512),
 'C23': (10000000, 512),
 'C24': (10000000, 512),
 'C25': (10000000, 512),
 'C26': (10000000, 512),
 'C3': (10000000, 512),
 'C4': (10000000, 512),
 'C5': (10000000, 512),
 'C6': (10000000, 512),
 'C7': (10000000, 512),
 'C8': (10000000, 512),
 'C9': (10000000, 512)}

It seems that num_buckets being 10M is replacing the cardinality for every column. However, I was expecting that only the columns its cardinality is greater than num_buckets would be limited, and for the rest it would be the same. This is the behaviors I am getting with the old API and Categorify+LamdaOp.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
albert17commented, Feb 1, 2021

This is how I was using it before

num_buckets = 10000000
proc.add_cat_preprocess([ops.Categorify(out_path=stats_path), ops.LambdaOp(op_name="MOD10M", f=lambda col: col % num_buckets)])

And I was getting the embedding sizes like this

embeddings = [c[0] for c in ops.get_embedding_sizes(workflow).values()]
embeddings = np.clip(a=embeddings, a_min=None, a_max=num_buckets).tolist()
print(embeddings)

And the result was looking as expected


[10000000, 10000000, 3014529, 400781, 11, 2209, 11869, 148, 4, 977, 15, 38713, 10000000, 10000000, 10000000, 584616, 12883, 109, 37, 17177, 7425, 20266, 4, 7085, 1535, 64]
1reaction
albert17commented, Feb 1, 2021

I have tried using HashBucket, and I am getting the same result than with Categorify.

{'C1': (10000000, 512),
 'C10': (10000000, 512),
 'C11': (10000000, 512),
 'C12': (10000000, 512),
 'C13': (10000000, 512),
 'C14': (10000000, 512),
 'C15': (10000000, 512),
 'C16': (10000000, 512),
 'C17': (10000000, 512),
 'C18': (10000000, 512),
 'C19': (10000000, 512),
 'C2': (10000000, 512),
 'C20': (10000000, 512),
 'C21': (10000000, 512),
 'C22': (10000000, 512),
 'C23': (10000000, 512),
 'C24': (10000000, 512),
 'C25': (10000000, 512),
 'C26': (10000000, 512),
 'C3': (10000000, 512),
 'C4': (10000000, 512),
 'C5': (10000000, 512),
 'C6': (10000000, 512),
 'C7': (10000000, 512),
 'C8': (10000000, 512),
 'C9': (10000000, 512)}
Read more comments on GitHub >

github_iconTop Results From Across the Web

TabularProc constructors cause error · Issue #2542 · fastai ...
Describe the bug TabularProc subclasses cannot be used via their constructors. E.g., FillMissing is okay, but FillMissing(cat_names, ...
Read more >
Source code for nvtabular.ops.categorify
We need to figure out # if this list contains any multi-column groups, and if there # are any (obvious) problems with these...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found