question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory/Device management cpu and gpu

See original GitHub issue

Great work with the djl package, very nice handling, and great performance!

Description

I have a machine with a GPU, but I want to use only the CPU for training a model. In general, it works to set the devices to CPU and to compute everything on the CPU. However, the memory of the GPU is still used, which I cannot avoid. A small code example is below taken from a loop over the batches:

                    NDList data = batch.getData();
                    NDList label = batch.getLabels();
                    Device[] a = trainer.getDevices();
                    Device b = data.head().getDevice();
                    Device c = label.head().getDevice();
                    Device d = trainer.getModel().getNDManager().getDevice();
                    System.out.println("length a:" + a.length);
                    System.out.println("a:" + a[0]);
                    System.out.println("b:" + b);
                    System.out.println("c:" + c);
                    System.out.println("d:" + d);
                    //Here in the forward step are around 1 GB GPU RAM allocated
                    NDList forw = trainer.forward(data, label);

Expected Behavior

The GPU is not used.

Error Message

Something is allocated in the GPU RAM

What have you tried to solve it?

Set every possible device I could find to CPU with Device.cpu()

Environment Info

djl 0.10.0

PASTE OUTPUT HERE

length a:1 a:cpu() b:cpu() c:cpu() d:cpu()

Thank you and best wishes Thomas

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ThomasZiegenheincommented, Apr 1, 2021

Ouh, yeah I am sorry 😃 The example case might be too simple, I created the example to show the original error. I think my example MXModel which is loaded in might produce no non-zero-gradients, with an MLP created with djl it was actually training when I remember correctly.

0reactions
aksrajvanshicommented, Apr 16, 2021

@ThomasZiegenhein So the problem seems to be from the model loading. The model loaded needs to have a layer that has params with attached gradients. In your case, if we look into the lossValue, the property hasGradient() evaluates to false. That is because the model you imported doesn’t have params with gradients attached and nothing to do learning from.

@frankfliu can add more to it if you feel it can be explained better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Introducing Low-Level GPU Virtual Memory Management
CUDA 10.2 introduces a new set of API functions for virtual memory management that enable you to build more efficient dynamic data ...
Read more >
Unified Memory: The Final Piece Of The GPU Programming ...
Support for unified memory across CPUs and GPUs in accelerated computing ... Unified memory has a profound impact on data management for GPU...
Read more >
Analyzing memory management methods on integrated CPU ...
Memory on CPU/GPU systems is typically managed by a software framework such as OpenCL or CUDA, which includes a runtime library, and ...
Read more >
PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, how to optimise memory usage and best practises for debugging memory errors.
Read more >
Analyzing Memory Management Methods on ... - People
copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices can carry large overheads. Memory on CPU/GPU...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found