Out of memory exception when fitting model
See original GitHub issueTo get help from the community, check out our Google group.
TensorFlow.js version
5.6.0
Browser version
Version 67.0.3396.99 (Official Build) (64-bit)
Describe the problem or feature request
When fitting a fairly simple dense model (100-125-75 nodes per layer), we hit an out of Memory exception. Note: performance noticeably drops over time too.
I’m training multiple models, small dataset (~100 rows, 16 inputs per row). We get this exception on the second run, on epoch ~1000
The machine has 16 G of ram, and AMD RX-480 video card
If we continue the exception, progress continues, but still very slowly.
Code to reproduce the bug / link to feature request
I’m training for a lot of epochs Here is the full training function.
async trainModel(model, Xdata, Ydata) {
const xs = tf.tensor2d(Xdata);
const ys = tf.tensor2d(Ydata);
const yscaled = ys.mul(this.scaleFactor);
await model.fit(xs, yscaled, {
batchSize: 25,
epochs: this.epochs,
callbacks: {
onEpochEnd: async (epoch, log) => {
console.log(`Epoch ${epoch}: loss = ${log.loss}`);
}
}
});
// Do a quick test on the setting value
let ypred = model.predict(xs);
let ypredDescaled = ypred.div(this.scaleFactor);
let pdata = ypredDescaled.dataSync();
for (let i = 0; i < pdata.length; i++) {
// TODO: evaluate differences
}
// As this is an async operation, manually dispose of allocated memory
xs.dispose();
ys.dispose();
yscaled.dispose();
ypred.dispose();
ypredDescaled.dispose();
}
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (3 by maintainers)
Top Results From Across the Web
How to solve out of memory error while fitting the model? #579
LGBMClassifier.fit without flaml, it gets OOM error too on the whole dataset. So it's not an FLAML-specific issue. Your dataset is too large...
Read more >Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
The following may occur when a model has exhausted the memory : Resource Exhausted Error : an error message that indicates Out Of...
Read more >Out of Memory Error while training model using Keras
OOM(Out of memory) error comes when your model want to use more memory then available memory. Check your GPU memory whether memory is ......
Read more >When memory errors occur with model.fit(), is it due to GPU ...
Yes. The error is for GPU memory. You should look at training in batches options if you haven't already. This thread has short...
Read more >Error: "Out of memory" - Knowledge Base - COMSOL
Solution · 1) Check the amount of available memory on your system · 2) Check the size of your COMSOL model · 3)...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ok, so not sure if this helps, but I tested my code running on nodejs/ubuntu as well. I had the similar sorts of results there, with my entire 12G of ram being eaten when trying to train a moderate sized graph (65K rows, 2000 epochs). I would start with around 500megs of memory, then grow till well over 2 gigs per process (running 4 simultaneous trainings).
NOTE: I’m running on the CPU for this test.
It appears that (for node anyway) the leak is on the JS side. Periodically printing process.memoryUsage() shows the heap and RSS would grow continuously while training, while external is fairly constant. I assume tensors refer to external memory, rather than anything on the JS heap.
Some things I tried: Calling GC() in the epoch callback did not resolve the issue. Running multiple iterations of model.fit, with a GC between iterations did not resolve the issue Running multiple iterations of model.fit by saving the model, releasing it, calling GC(), calling tf.disposeVariables(), reloading the model, and continuing training, still did not fix the issue. All of the above, plus calling tf.setBackend(tf.getBackend()) FINALLY FIXED IT!
I now do a full flush of the system every 250 epochs. All of this significantly improved the training performance and dropped my total memory usage to a nice low 200m - even lower than starting conditions.
Going to refer this to @caisq in case this because of a memory leak inside
model.fit
. We did jave one that was fixed here https://github.com/tensorflow/tfjs-layers/pull/252 so a fix may be incoming.