Memory runaway when training with Sequence in a loop
See original GitHub issueSystem information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): Yes, a minimal reproduction
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.9.1
- Python version: 3.8.10
- Bazel version (if compiling from source): –
- GPU model and memory: Tried on RTX3090 24GB and A100 40GB, 256GB RAM
- Exact command to reproduce:
Describe the problem.
When training the same model in a loop on multiple utils.Sequence
generators, memory usage rapidly explodes. Deleting the model itself and clearing the session does return most (but not all) of the memory. However, this is not a fix since I need the model to survive the training.
Using multiple generators per session is very natural for our application since we’re frequently training models using synthetic data created on-demand with parameters that change at different stages of training.
Describe the current behavior.
Memory does not get cleared when the data used to train the model is cleared.
Describe the expected behavior.
Memory gets cleared when the data used to train the model is cleared. Or, if nothing else, a method to manually clear whatever cache is storing the data without reloading the model.
Standalone code to reproduce the issue.
import os, psutil
import gc
import tensorflow as tf
import numpy as np
# Arbitrary small network
model = tf.keras.Sequential()
model.add(
tf.keras.layers.Conv2D(32, 3, padding="same", input_shape=(None, None, 3))
)
model.add(
tf.keras.layers.Conv2D(16, 3, padding="same")
)
model.add(
tf.keras.layers.Conv2D(1, 3, padding="same")
)
model.compile(loss="mse")
# Minimal implementation of Sequence (taken from docs)
class Sequence(tf.keras.utils.Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / self.batch_size))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) *
self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) *
self.batch_size]
return batch_x, batch_y
import os, psutil
import gc
process = psutil.Process(os.getpid())
gc.collect()
print("Before\t", process.memory_info().rss)
for idx in range(20):
x, y = np.random.rand(100, 512, 512, 3), np.random.rand(100, 512, 512, 1)
generator = Sequence(x, y, 8)
model.fit(generator, epochs=1, verbose=0)
del generator
gc.collect()
print(idx, "\t", process.memory_info().rss)
======== Output
Before 1374494720
0 4787523584
1 5837004800
2 6831996928
3 7806418944
4 8839163904
5 9804460032
6 10644090880
7 11651334144
8 12608495616
9 13422891008
Issue Analytics
- State:
- Created a year ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
Hi @BenjaminMidtvedt,
I have not been able to find a clear culprit for how this memory is sticking around, but something that catches my eye in your repro code is the iterative creation of a Sequence object and calls to
Model.fit
in a loop.The Sequence docs state that if you want to update your data between epochs you should override
on_epoch_end
. I’ve put together a gist adapting your code to use this technique, and the memory leak no longer exists. I’m not certain if this will support your use case.I would also consider checking out TensorFlow Dataset’s
from_generator
method, which more natively supports this type of use case.In general, because there is overhead in calling
Model.fit
, for this type of use case you should prefer using afrom_generator
dataset or usingModel.train_on_batch
.Hi @ianstenbit
Indeed, it is the recreation of the Sequence object that is the culprit. Sadly, it’s not a viable solution for us to use just one Sequence per run. I’ll check if native TensorFlow datasets can support my needs and if it solves the memory issue. If so, that’s an acceptable solution from my side!