Memory/Device management cpu and gpu
See original GitHub issueGreat work with the djl package, very nice handling, and great performance!
Description
I have a machine with a GPU, but I want to use only the CPU for training a model. In general, it works to set the devices to CPU and to compute everything on the CPU. However, the memory of the GPU is still used, which I cannot avoid. A small code example is below taken from a loop over the batches:
NDList data = batch.getData();
NDList label = batch.getLabels();
Device[] a = trainer.getDevices();
Device b = data.head().getDevice();
Device c = label.head().getDevice();
Device d = trainer.getModel().getNDManager().getDevice();
System.out.println("length a:" + a.length);
System.out.println("a:" + a[0]);
System.out.println("b:" + b);
System.out.println("c:" + c);
System.out.println("d:" + d);
//Here in the forward step are around 1 GB GPU RAM allocated
NDList forw = trainer.forward(data, label);
Expected Behavior
The GPU is not used.
Error Message
Something is allocated in the GPU RAM
What have you tried to solve it?
Set every possible device I could find to CPU with Device.cpu()
Environment Info
djl 0.10.0
PASTE OUTPUT HERE
length a:1 a:cpu() b:cpu() c:cpu() d:cpu()
Thank you and best wishes Thomas
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Introducing Low-Level GPU Virtual Memory Management
CUDA 10.2 introduces a new set of API functions for virtual memory management that enable you to build more efficient dynamic data ...
Read more >Unified Memory: The Final Piece Of The GPU Programming ...
Support for unified memory across CPUs and GPUs in accelerated computing ... Unified memory has a profound impact on data management for GPU...
Read more >Analyzing memory management methods on integrated CPU ...
Memory on CPU/GPU systems is typically managed by a software framework such as OpenCL or CUDA, which includes a runtime library, and ...
Read more >PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, how to optimise memory usage and best practises for debugging memory errors.
Read more >Analyzing Memory Management Methods on ... - People
copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices can carry large overheads. Memory on CPU/GPU...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ouh, yeah I am sorry 😃 The example case might be too simple, I created the example to show the original error. I think my example MXModel which is loaded in might produce no non-zero-gradients, with an MLP created with djl it was actually training when I remember correctly.
@ThomasZiegenhein So the problem seems to be from the model loading. The model loaded needs to have a layer that has params with attached gradients. In your case, if we look into the
lossValue
, the propertyhasGradient()
evaluates to false. That is because the model you imported doesn’t have params with gradients attached and nothing to do learning from.@frankfliu can add more to it if you feel it can be explained better.