question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] OOM error while trying to train on Outbrain dataset with Tensorflow

See original GitHub issue

Describe the bug A V100 16GB GPU now runs out memory when trying to run the Outbrain TF training notebook. This is a new problem that has cropped up in the past few days, so may be related to recent TF dataloader changes.

Steps/Code to reproduce bug Run the Outbrain example notebooks with a single V100 16GB GPU.

Expected behavior Example should complete successfully without running out of memory

Environment details (please complete the following information): TBD

Additional context

Stack trace:

--------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-18-b1725d7df4b2> in <module>
      5     experimental_run_tf_function=False
      6 )
----> 7 history = wide_and_deep_model.fit(train_dataset_tf, epochs=1)
/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1048          training_utils.RespectCompiledTrainableState(self):
   1049       # Creates a `tf.data.Dataset` and handles batch and epoch iteration.
-> 1050       data_handler = data_adapter.DataHandler(
   1051           x=x,
   1052           y=y,
/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution)
   1098 
   1099     adapter_cls = select_data_adapter(x, y)
-> 1100     self._adapter = adapter_cls(
   1101         x,
   1102         y,
/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weights, shuffle, workers, use_multiprocessing, max_queue_size, model, **kwargs)
    896                        "`keras.utils.Sequence` as input.")
    897 
--> 898     self._size = len(x)
    899     self._shuffle_sequence = shuffle
    900     self._keras_sequence = x
/nvtabular/nvtabular/loader/tensorflow.py in __len__(self)
    237         # TODO: what's a better way to do this inheritance
    238         # of the appropriate methods? A Metaclass?
--> 239         return DataLoader.__len__(self)
    240 
    241     def __getitem__(self, idx):
/nvtabular/nvtabular/loader/backend.py in __len__(self)
    203 
    204     def __len__(self):
--> 205         return _num_steps(len(self._buff), self.batch_size)
    206 
    207     @property
/nvtabular/nvtabular/loader/backend.py in __len__(self)
     61 
     62     def __len__(self):
---> 63         return len(self.itr)
     64 
     65     @property
/nvtabular/nvtabular/io/dataset.py in __len__(self)
    766 
    767     def __len__(self):
--> 768         return len(self._ddf.partitions[self.indices])
    769 
    770     def __iter__(self):
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/core.py in __len__(self)
   3654             return super().__len__()
   3655         else:
-> 3656             return len(s)
   3657 
   3658     def __contains__(self, key):
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/core.py in __len__(self)
    555 
    556     def __len__(self):
--> 557         return self.reduction(
    558             len, np.sum, token="len", meta=int, split_every=False
    559         ).compute()
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
    281         dask.base.compute
    282         """
--> 283         (result,) = compute(self, traverse=False, **kwargs)
    284         return result
    285 
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    563         postcomputes.append(x.__dask_postcompute__())
    564 
--> 565     results = schedule(dsk, keys, **kwargs)
    566     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    567 
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     74                 pools[thread][num_workers] = pool
     75 
---> 76     results = get_async(
     77         pool.apply_async,
     78         len(pool._pool),
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    485                         _execute_task(task, data)  # Re-execute locally
    486                     else:
--> 487                         raise_exception(exc, tb)
    488                 res, worker_id = loads(res_info)
    489                 state["cache"][key] = res
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/local.py in reraise(exc, tb)
    315     if exc.__traceback__ is not tb:
    316         raise exc.with_traceback(tb)
--> 317     raise exc
    318 
    319 
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    220     try:
    221         task, data = loads(task_info)
--> 222         result = _execute_task(task, data)
    223         id = get_id()
    224         result = dumps((result, id))
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in <genexpr>(.0)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    113     """
    114     if isinstance(arg, list):
--> 115         return [_execute_task(a, cache) for a in arg]
    116     elif istask(arg):
    117         func, args = arg[0], arg[1:]
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in <listcomp>(.0)
    113     """
    114     if isinstance(arg, list):
--> 115         return [_execute_task(a, cache) for a in arg]
    116     elif istask(arg):
    117         func, args = arg[0], arg[1:]
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in <genexpr>(.0)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in read_parquet_part(fs, func, meta, part, columns, index, kwargs)
    381 
    382     if isinstance(part, list):
--> 383         dfs = [
    384             func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
    385             for (rg, kw) in part
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in <listcomp>(.0)
    382     if isinstance(part, list):
    383         dfs = [
--> 384             func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
    385             for (rg, kw) in part
    386         ]
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cudf/io/parquet.py in read_partition(fs, piece, columns, index, categories, partitions, **kwargs)
     59         strings_to_cats = kwargs.get("strings_to_categorical", False)
     60         if cudf.utils.ioutils._is_local_filesystem(fs):
---> 61             df = cudf.read_parquet(
     62                 path,
     63                 engine="cudf",
/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/io/parquet.py in read_parquet(filepath_or_buffer, engine, columns, filters, row_groups, skiprows, num_rows, strings_to_categorical, use_pandas_metadata, *args, **kwargs)
    249 
    250     if engine == "cudf":
--> 251         return libparquet.read_parquet(
    252             filepaths_or_buffers,
    253             columns=columns,
cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()
cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()
MemoryError: std::bad_alloc: CUDA error at: /opt/conda/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
rnyakcommented, Apr 2, 2021

@karlhigley and @jperez999 I could train the nb after pulling the recent changes you folks did, and reducing the buffer size on a 16 GB V100.

1reaction
karlhigleycommented, Mar 31, 2021

@rjzamora says:

self._ddf.partitions[self.indices] will produce a dask DataFrame (a subset of the original ddf), and the len will probably alias to len(<ddf>.index). So, it will probably try to read from the files to produce a index for each partition. This can probably be optimized to use the metadata for a parquet file, but I don’t think this is being done right now.

we can probably provide our own optimization until it gets into Dask

We want to use the metadata in the same way we do in num_rows, but we need to map the row-group metadata onto the expected ddf partitions. So, we will need to use the kind of tricky logic used in _file_partition_map

Note that the file_partition_map code is generating a map between files and partitions, but something similar could also map row-groups to partition lengths

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >
Out of memory error during training · Issue #32707 - GitHub
I am getting an out of memory error during the training of a encoder-decoder model using the subclassing api of tensorflow 2.
Read more >
OOM error when running custom Tensorflow training loop for ...
Try training on batches of say a hundred data points. The reason model.fit doesn't give an OOM is that that model.fit defaults to...
Read more >
OOM error when training on full dataset (tensorflow) - Reddit
First, are you running out of GPU memory or "normal" memory? If it's the latter, changing the batches or anything else about the...
Read more >
OOM error while training BioBert on SQuAD dataset - Models
Runtime: 32GB-RAM, 8-Core-CPU, 8-GB-GPU Environment: DeepPavlov = 0.14.0, Tensorflow-gpu = 1.15.2, CUDA = 11.2, Ubuntu = 20.04.1 LTS, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found