[BUG] OOM error while trying to train on Outbrain dataset with Tensorflow
See original GitHub issueDescribe the bug A V100 16GB GPU now runs out memory when trying to run the Outbrain TF training notebook. This is a new problem that has cropped up in the past few days, so may be related to recent TF dataloader changes.
Steps/Code to reproduce bug Run the Outbrain example notebooks with a single V100 16GB GPU.
Expected behavior Example should complete successfully without running out of memory
Environment details (please complete the following information): TBD
Additional context
Stack trace:
--------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-18-b1725d7df4b2> in <module>
5 experimental_run_tf_function=False
6 )
----> 7 history = wide_and_deep_model.fit(train_dataset_tf, epochs=1)
/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1048 training_utils.RespectCompiledTrainableState(self):
1049 # Creates a `tf.data.Dataset` and handles batch and epoch iteration.
-> 1050 data_handler = data_adapter.DataHandler(
1051 x=x,
1052 y=y,
/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution)
1098
1099 adapter_cls = select_data_adapter(x, y)
-> 1100 self._adapter = adapter_cls(
1101 x,
1102 y,
/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/data_adapter.py in __init__(self, x, y, sample_weights, shuffle, workers, use_multiprocessing, max_queue_size, model, **kwargs)
896 "`keras.utils.Sequence` as input.")
897
--> 898 self._size = len(x)
899 self._shuffle_sequence = shuffle
900 self._keras_sequence = x
/nvtabular/nvtabular/loader/tensorflow.py in __len__(self)
237 # TODO: what's a better way to do this inheritance
238 # of the appropriate methods? A Metaclass?
--> 239 return DataLoader.__len__(self)
240
241 def __getitem__(self, idx):
/nvtabular/nvtabular/loader/backend.py in __len__(self)
203
204 def __len__(self):
--> 205 return _num_steps(len(self._buff), self.batch_size)
206
207 @property
/nvtabular/nvtabular/loader/backend.py in __len__(self)
61
62 def __len__(self):
---> 63 return len(self.itr)
64
65 @property
/nvtabular/nvtabular/io/dataset.py in __len__(self)
766
767 def __len__(self):
--> 768 return len(self._ddf.partitions[self.indices])
769
770 def __iter__(self):
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/core.py in __len__(self)
3654 return super().__len__()
3655 else:
-> 3656 return len(s)
3657
3658 def __contains__(self, key):
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/core.py in __len__(self)
555
556 def __len__(self):
--> 557 return self.reduction(
558 len, np.sum, token="len", meta=int, split_every=False
559 ).compute()
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
281 dask.base.compute
282 """
--> 283 (result,) = compute(self, traverse=False, **kwargs)
284 return result
285
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
563 postcomputes.append(x.__dask_postcompute__())
564
--> 565 results = schedule(dsk, keys, **kwargs)
566 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
567
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
74 pools[thread][num_workers] = pool
75
---> 76 results = get_async(
77 pool.apply_async,
78 len(pool._pool),
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
485 _execute_task(task, data) # Re-execute locally
486 else:
--> 487 raise_exception(exc, tb)
488 res, worker_id = loads(res_info)
489 state["cache"][key] = res
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/local.py in reraise(exc, tb)
315 if exc.__traceback__ is not tb:
316 raise exc.with_traceback(tb)
--> 317 raise exc
318
319
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
220 try:
221 task, data = loads(task_info)
--> 222 result = _execute_task(task, data)
223 id = get_id()
224 result = dumps((result, id))
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in <genexpr>(.0)
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
113 """
114 if isinstance(arg, list):
--> 115 return [_execute_task(a, cache) for a in arg]
116 elif istask(arg):
117 func, args = arg[0], arg[1:]
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in <listcomp>(.0)
113 """
114 if isinstance(arg, list):
--> 115 return [_execute_task(a, cache) for a in arg]
116 elif istask(arg):
117 func, args = arg[0], arg[1:]
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in <genexpr>(.0)
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in read_parquet_part(fs, func, meta, part, columns, index, kwargs)
381
382 if isinstance(part, list):
--> 383 dfs = [
384 func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
385 for (rg, kw) in part
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in <listcomp>(.0)
382 if isinstance(part, list):
383 dfs = [
--> 384 func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
385 for (rg, kw) in part
386 ]
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cudf/io/parquet.py in read_partition(fs, piece, columns, index, categories, partitions, **kwargs)
59 strings_to_cats = kwargs.get("strings_to_categorical", False)
60 if cudf.utils.ioutils._is_local_filesystem(fs):
---> 61 df = cudf.read_parquet(
62 path,
63 engine="cudf",
/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/io/parquet.py in read_parquet(filepath_or_buffer, engine, columns, filters, row_groups, skiprows, num_rows, strings_to_categorical, use_pandas_metadata, *args, **kwargs)
249
250 if engine == "cudf":
--> 251 return libparquet.read_parquet(
252 filepaths_or_buffers,
253 columns=columns,
cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()
cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()
MemoryError: std::bad_alloc: CUDA error at: /opt/conda/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >Out of memory error during training · Issue #32707 - GitHub
I am getting an out of memory error during the training of a encoder-decoder model using the subclassing api of tensorflow 2.
Read more >OOM error when running custom Tensorflow training loop for ...
Try training on batches of say a hundred data points. The reason model.fit doesn't give an OOM is that that model.fit defaults to...
Read more >OOM error when training on full dataset (tensorflow) - Reddit
First, are you running out of GPU memory or "normal" memory? If it's the latter, changing the batches or anything else about the...
Read more >OOM error while training BioBert on SQuAD dataset - Models
Runtime: 32GB-RAM, 8-Core-CPU, 8-GB-GPU Environment: DeepPavlov = 0.14.0, Tensorflow-gpu = 1.15.2, CUDA = 11.2, Ubuntu = 20.04.1 LTS, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@karlhigley and @jperez999 I could train the nb after pulling the recent changes you folks did, and reducing the buffer size on a 16 GB V100.
@rjzamora says: