question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] NVTabular data loader for TensorFlow validation is slow

See original GitHub issue

Describe the bug Using NVTabular data loader for TensorFlow for validation with criteo dataset is slow:

    validation_callback = KerasSequenceValidater(valid_dataset_tf)
    history = model.fit(train_dataset_tf, 
                        epochs=EPOCHS, 
                        steps_per_epoch=20, callbacks=[validation_callback])

Training : 2min for 2288 steps Validation: Estimated 55min for 3003 steps Same batch-size, dataset, etc. … The validation dataset is 1.3x bigger, but iterating through the validation dataset takes 27x more time than for training.

Steps/Code to reproduce bug An example is provided here: https://github.com/bschifferer/NVTabular/blob/criteo_tf_slow/examples/criteo_tensorflow_slow.ipynb

You notice that iterating over the training dataset takes in average ~1sec per 10 batches but it takes 4-6s per 10 batches in the validation dataset.

Expected behavior Validation loop should be similar fast than training loop

Additional context There are multiple hypothesis and tests:

  • This behavior can be observed by just iterating over the dataset and execute the forward-pass of the model
  • If we switch training/validation dataloader, then the validation dataloader is fast and the training dataloader is slow. Meaning, that a iteration over the 2nd data loader. Hypothesis is that the GPU memory is not released from the 1st data loader and blocks the pipeline
  • If we remove the forward-pass of the model in the loop, then both iterations are fast. It has probably something to do with moving the data to the TensorFlow model
  • I tried out using tf.keras.backend.clear_session() between the iterations, but did not help
  • I tried out between the iterations
from numba import cuda
cuda.select_device(0)
cuda.close()

but resulted in an error

  • I tried out to use separate subprocesses, but did not improve performance

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
EvenOldridgecommented, Dec 2, 2020

@jperez999 we think this is happening in the TF dataloader only. @bschifferer will confirm by testing PyTorch. Can you take a look please.

0reactions
benfredcommented, Jan 19, 2021

This seems to be related to insufficient host / gpu memory - closing

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for nvtabular.loader.tensorflow
Sequence, DataLoader): """ Infinite generator used to asynchronously iterate through CSV or Parquet dataframes on GPU by leveraging an NVTabular `Dataset`.
Read more >
Announcing the NVIDIA NVTabular Open Beta with Multi-GPU ...
The NVTabular data loader for TensorFlow is designed to feed tabular data ... such as the training and validation data, data schema, ...
Read more >
slow training despite using tf data pipeline - Stack Overflow
I have created data pipeline with the help of tf.data API of tensorflow. My issue is that training is too slow despite using...
Read more >
Distributed Data Science using NVTabular on Spark & Dask
GPU doesn't get fully utilized as the data loader is comparatively much slower in preparing the next batch. With NVTabular highly customized tabular...
Read more >
NVTabular-PyTorch-DeepFM - Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from MovieLens 25M Dataset.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found