question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory exception when fitting model

See original GitHub issue

To get help from the community, check out our Google group.

TensorFlow.js version

5.6.0

Browser version

Version 67.0.3396.99 (Official Build) (64-bit)

Describe the problem or feature request

When fitting a fairly simple dense model (100-125-75 nodes per layer), we hit an out of Memory exception. Note: performance noticeably drops over time too.

I’m training multiple models, small dataset (~100 rows, 16 inputs per row). We get this exception on the second run, on epoch ~1000

The machine has 16 G of ram, and AMD RX-480 video card

If we continue the exception, progress continues, but still very slowly.

Code to reproduce the bug / link to feature request

I’m training for a lot of epochs Here is the full training function.

async trainModel(model, Xdata, Ydata) {
    const xs = tf.tensor2d(Xdata);
    const ys = tf.tensor2d(Ydata);

    const yscaled = ys.mul(this.scaleFactor);
    await model.fit(xs, yscaled, {
        batchSize: 25,
        epochs: this.epochs,
        callbacks: {
            onEpochEnd: async (epoch, log) => {
                console.log(`Epoch ${epoch}: loss = ${log.loss}`);
            }
        }
    });

    // Do a quick test on the setting value
    let ypred = model.predict(xs);
    let ypredDescaled = ypred.div(this.scaleFactor);
    let pdata = ypredDescaled.dataSync();
    for (let i = 0; i < pdata.length; i++) {
        // TODO: evaluate differences
    }

    // As this is an async operation, manually dispose of allocated memory
    xs.dispose();
    ys.dispose();
    yscaled.dispose();
    ypred.dispose();
    ypredDescaled.dispose();
}

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
FrozenKiwicommented, Jul 20, 2018

Ok, so not sure if this helps, but I tested my code running on nodejs/ubuntu as well. I had the similar sorts of results there, with my entire 12G of ram being eaten when trying to train a moderate sized graph (65K rows, 2000 epochs). I would start with around 500megs of memory, then grow till well over 2 gigs per process (running 4 simultaneous trainings).

NOTE: I’m running on the CPU for this test.

It appears that (for node anyway) the leak is on the JS side. Periodically printing process.memoryUsage() shows the heap and RSS would grow continuously while training, while external is fairly constant. I assume tensors refer to external memory, rather than anything on the JS heap.

Some things I tried: Calling GC() in the epoch callback did not resolve the issue. Running multiple iterations of model.fit, with a GC between iterations did not resolve the issue Running multiple iterations of model.fit by saving the model, releasing it, calling GC(), calling tf.disposeVariables(), reloading the model, and continuing training, still did not fix the issue. All of the above, plus calling tf.setBackend(tf.getBackend()) FINALLY FIXED IT!

I now do a full flush of the system every 250 epochs. All of this significantly improved the training performance and dropped my total memory usage to a nice low 200m - even lower than starting conditions.

2reactions
tafsiricommented, Jul 3, 2018

Going to refer this to @caisq in case this because of a memory leak inside model.fit. We did jave one that was fixed here https://github.com/tensorflow/tfjs-layers/pull/252 so a fix may be incoming.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve out of memory error while fitting the model? #579
LGBMClassifier.fit without flaml, it gets OOM error too on the whole dataset. So it's not an FLAML-specific issue. Your dataset is too large...
Read more >
Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
The following may occur when a model has exhausted the memory : Resource Exhausted Error : an error message that indicates Out Of...
Read more >
Out of Memory Error while training model using Keras
OOM(Out of memory) error comes when your model want to use more memory then available memory. Check your GPU memory whether memory is ......
Read more >
When memory errors occur with model.fit(), is it due to GPU ...
Yes. The error is for GPU memory. You should look at training in batches options if you haven't already. This thread has short...
Read more >
Error: "Out of memory" - Knowledge Base - COMSOL
Solution · 1) Check the amount of available memory on your system · 2) Check the size of your COMSOL model · 3)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found