Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Rossmann notebook still broken from recent sweeping changes to NVTabular

See original GitHub issue

After the recent dataset changes / dask API compatibility changes, and after the recent fix to the Rossmann notebook (https://github.com/NVIDIA/NVTabular/pull/140), which made the notebook run without runtime errors, convergence and final results are still much worse than what they once were.

This is true for both the TensorFlow and fast.ai implementations, although the fast.ai implementation is affected more harshly.

It’s difficult to be more specific about this bug, because I do not know what the root cause is, but I thought it’s worth reporting, so that someone familiar with the dataset / dask changes can investigate.

Steps to reproduce:

First, roll back to an old commit, before dataset / dask changes, but incorporate the changes to the Rossmann notebook that fixed bugs, improved organization, etc.:

# d44defa Refactor get_emb_sz (#110)
git checkout d44defa

# c179905 Merge pull request #123 from NVIDIA/vinhn-demo-notebook
git checkout c179905 -- examples/rossmann-store-sales-example.ipynb

In this setting, if we run the notebook 3 times, training is always stable, and we get final RMSPEs of

0.19, 0.19, 0.21 for Tensorflow
0.19, 0.22, 0.21 for fast.ai

Next, come back to the more recent commit, after dataset / dask changes, and run the notebook:

# 294b480 Rossmann notebook fixes (#140)
git checkout 294b480

In this setting, if we run the notebook 3 times, training is no longer stable, and we get final RMSPEs of

0.27 0.29 0.29 for TensorFlow
0.69 0.42 0.48 for fast.ai

Issue Analytics

State:
Created 3 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

rjzamoracommented, Jul 21, 2020

Thanks for investigating @rdipietro - It seems reasonable to me that there could be a bug/problem in one of the dask-based operations. There is not much “validation” in the unit tests - mostly high-level sanity checks.

I will try to investigate the pre-processing phase of the notebook later tonight or tomorrow and see if there are any obvious problems with the processed dataset. My understanding is that the dataset is small enough to do preprocessing with cudf/pandas alone. Is that right?

0reactions

rdipietrocommented, Jul 22, 2020

Thanks – I’ll try that