Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] record_stats=False not working as expected

See original GitHub issue

Describe the bug record_stats=False corrupts validation dataset. However, it works fine, when I set record_stats=True on the validation dataset. Steps/Code to reproduce bug

Workflow Processor: workflow_processor

Train dataset record_stats=True and Validation dataset record_stats=False -> as you can see CumCount Max is 896 instead of 299.

1_train_true_valid_false

Train dataset record_stats=True and Validation dataset record_stats=True -> as you can see CumCount Max is now 299.

Expected behavior I am expecting this output when Validation dataset record_stats is set to False.

2_train_true_valid_true

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of NVTabular install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

Installed 0.2 version from here…https://pypi.org/project/nvtabular/

And using it with Rapids 0.16 as I need pivot().

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 3 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

1reaction

rjzamoracommented, Nov 3, 2020

Perhaps it would be reasonable to add a Workflow parameter to specify a list of columns that should pass through NVTabular unchanged - Thoughts on this @benfred ? I’m honestly unsure how often a feature like this would be used.

1reaction

rjzamoracommented, Nov 3, 2020

is there a better way to pass down id column untouched via NVTabular

Rather than allowing the Categorify to act on all categorical columns (the default), you can specify a subset with Categorify(columns=<your-list>).