Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory exception when fitting model

See original GitHub issue

To get help from the community, check out our Google group.

TensorFlow.js version

5.6.0

Browser version

Version 67.0.3396.99 (Official Build) (64-bit)

Describe the problem or feature request

When fitting a fairly simple dense model (100-125-75 nodes per layer), we hit an out of Memory exception. Note: performance noticeably drops over time too.

I’m training multiple models, small dataset (~100 rows, 16 inputs per row). We get this exception on the second run, on epoch ~1000

The machine has 16 G of ram, and AMD RX-480 video card

If we continue the exception, progress continues, but still very slowly.

Code to reproduce the bug / link to feature request

I’m training for a lot of epochs Here is the full training function.

async trainModel(model, Xdata, Ydata) {
    const xs = tf.tensor2d(Xdata);
    const ys = tf.tensor2d(Ydata);

    const yscaled = ys.mul(this.scaleFactor);
    await model.fit(xs, yscaled, {
        batchSize: 25,
        epochs: this.epochs,
        callbacks: {
            onEpochEnd: async (epoch, log) => {
                console.log(`Epoch ${epoch}: loss = ${log.loss}`);
            }
        }
    });

    // Do a quick test on the setting value
    let ypred = model.predict(xs);
    let ypredDescaled = ypred.div(this.scaleFactor);
    let pdata = ypredDescaled.dataSync();
    for (let i = 0; i < pdata.length; i++) {
        // TODO: evaluate differences
    }

    // As this is an async operation, manually dispose of allocated memory
    xs.dispose();
    ys.dispose();
    yscaled.dispose();
    ypred.dispose();
    ypredDescaled.dispose();
}

Issue Analytics

State:
Created 5 years ago
Comments:11 (3 by maintainers)

Top GitHub Comments

3reactions

FrozenKiwicommented, Jul 20, 2018

Ok, so not sure if this helps, but I tested my code running on nodejs/ubuntu as well. I had the similar sorts of results there, with my entire 12G of ram being eaten when trying to train a moderate sized graph (65K rows, 2000 epochs). I would start with around 500megs of memory, then grow till well over 2 gigs per process (running 4 simultaneous trainings).

NOTE: I’m running on the CPU for this test.

It appears that (for node anyway) the leak is on the JS side. Periodically printing process.memoryUsage() shows the heap and RSS would grow continuously while training, while external is fairly constant. I assume tensors refer to external memory, rather than anything on the JS heap.

Some things I tried: Calling GC() in the epoch callback did not resolve the issue. Running multiple iterations of model.fit, with a GC between iterations did not resolve the issue Running multiple iterations of model.fit by saving the model, releasing it, calling GC(), calling tf.disposeVariables(), reloading the model, and continuing training, still did not fix the issue. All of the above, plus calling tf.setBackend(tf.getBackend()) FINALLY FIXED IT!

I now do a full flush of the system every 250 epochs. All of this significantly improved the training performance and dropped my total memory usage to a nice low 200m - even lower than starting conditions.

2reactions

tafsiricommented, Jul 3, 2018

Going to refer this to @caisq in case this because of a memory leak inside model.fit. We did jave one that was fixed here https://github.com/tensorflow/tfjs-layers/pull/252 so a fix may be incoming.