question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory runaway when training with Sequence in a loop

See original GitHub issue

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): Yes, a minimal reproduction
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.9.1
  • Python version: 3.8.10
  • Bazel version (if compiling from source): –
  • GPU model and memory: Tried on RTX3090 24GB and A100 40GB, 256GB RAM
  • Exact command to reproduce:

Describe the problem.

When training the same model in a loop on multiple utils.Sequence generators, memory usage rapidly explodes. Deleting the model itself and clearing the session does return most (but not all) of the memory. However, this is not a fix since I need the model to survive the training.

Using multiple generators per session is very natural for our application since we’re frequently training models using synthetic data created on-demand with parameters that change at different stages of training.

Describe the current behavior.

Memory does not get cleared when the data used to train the model is cleared.

Describe the expected behavior.

Memory gets cleared when the data used to train the model is cleared. Or, if nothing else, a method to manually clear whatever cache is storing the data without reloading the model.

Standalone code to reproduce the issue.

import os, psutil
import gc
import tensorflow as tf
import numpy as np


# Arbitrary small network
model = tf.keras.Sequential()
model.add(
    tf.keras.layers.Conv2D(32, 3, padding="same", input_shape=(None, None, 3))
)
model.add(
    tf.keras.layers.Conv2D(16, 3, padding="same")
)
model.add(
    tf.keras.layers.Conv2D(1, 3, padding="same")
)
model.compile(loss="mse")

# Minimal implementation of Sequence (taken from docs)
class Sequence(tf.keras.utils.Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / self.batch_size))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) *
        self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) *
        self.batch_size]

        return batch_x, batch_y

import os, psutil
import gc
process = psutil.Process(os.getpid())

gc.collect()    
print("Before\t", process.memory_info().rss) 

for idx in range(20):
    x, y = np.random.rand(100, 512, 512, 3), np.random.rand(100, 512, 512, 1)
    generator = Sequence(x, y, 8)
    
    model.fit(generator, epochs=1, verbose=0)

    del generator
    gc.collect()    
    print(idx, "\t", process.memory_info().rss) 

======== Output
Before	 1374494720
0 	 4787523584
1 	 5837004800
2 	 6831996928
3 	 7806418944
4 	 8839163904
5 	 9804460032
6 	 10644090880
7 	 11651334144
8 	 12608495616
9 	 13422891008

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ianstenbitcommented, Sep 16, 2022

Hi @BenjaminMidtvedt,

I have not been able to find a clear culprit for how this memory is sticking around, but something that catches my eye in your repro code is the iterative creation of a Sequence object and calls to Model.fit in a loop.

The Sequence docs state that if you want to update your data between epochs you should override on_epoch_end. I’ve put together a gist adapting your code to use this technique, and the memory leak no longer exists. I’m not certain if this will support your use case.

I would also consider checking out TensorFlow Dataset’s from_generator method, which more natively supports this type of use case.

In general, because there is overhead in calling Model.fit, for this type of use case you should prefer using a from_generator dataset or using Model.train_on_batch.

1reaction
BenjaminMidtvedtcommented, Sep 16, 2022

Hi @ianstenbit

Indeed, it is the recreation of the Sequence object that is the culprit. Sadly, it’s not a viable solution for us to use just one Sequence per run. I’ll check if native TensorFlow datasets can support my needs and if it solves the memory issue. If so, that’s an acceptable solution from my side!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras (TensorFlow, CPU): Training Sequential models in loop ...
The memory leak stems from Keras and TensorFlow using a single "default graph" to store the network structure, which increases in size with ......
Read more >
Dealing with memory leak issue in Keras model training
Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA's Tesla V100-DGXS-32GB.
Read more >
Troubleshooting Memory Growth Issues in TestStand Systems
Unbounded memory growth over multiple test loops. This may occur if there is a memory leak in the test sequence or in a...
Read more >
Memory Leak Detection Algorithms in the Cloud-based ... - arXiv
In the offline training phase, Trend Lines Fitting module further selects a sequence of observations xt−L:t between the two change points: one ...
Read more >
Stimulation Augments Spike Sequence Replay and Memory ...
We found that when closed-loop stimulation was applied during the Down states of sleep slow oscillation, particularly right before the transition from Down...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found