Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory/Device management cpu and gpu

See original GitHub issue

Great work with the djl package, very nice handling, and great performance!

Description

I have a machine with a GPU, but I want to use only the CPU for training a model. In general, it works to set the devices to CPU and to compute everything on the CPU. However, the memory of the GPU is still used, which I cannot avoid. A small code example is below taken from a loop over the batches:

                    NDList data = batch.getData();
                    NDList label = batch.getLabels();
                    Device[] a = trainer.getDevices();
                    Device b = data.head().getDevice();
                    Device c = label.head().getDevice();
                    Device d = trainer.getModel().getNDManager().getDevice();
                    System.out.println("length a:" + a.length);
                    System.out.println("a:" + a[0]);
                    System.out.println("b:" + b);
                    System.out.println("c:" + c);
                    System.out.println("d:" + d);
                    //Here in the forward step are around 1 GB GPU RAM allocated
                    NDList forw = trainer.forward(data, label);

Expected Behavior

The GPU is not used.

Error Message

Something is allocated in the GPU RAM

What have you tried to solve it?

Set every possible device I could find to CPU with Device.cpu()

Environment Info

djl 0.10.0

PASTE OUTPUT HERE

length a:1 a:cpu() b:cpu() c:cpu() d:cpu()

Thank you and best wishes Thomas

Issue Analytics

State:
Created 3 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

ThomasZiegenheincommented, Apr 1, 2021

Ouh, yeah I am sorry 😃 The example case might be too simple, I created the example to show the original error. I think my example MXModel which is loaded in might produce no non-zero-gradients, with an MLP created with djl it was actually training when I remember correctly.

0reactions

aksrajvanshicommented, Apr 16, 2021

@ThomasZiegenhein So the problem seems to be from the model loading. The model loaded needs to have a layer that has params with attached gradients. In your case, if we look into the lossValue, the property hasGradient() evaluates to false. That is because the model you imported doesn’t have params with gradients attached and nothing to do learning from.

@frankfliu can add more to it if you feel it can be explained better.