Training doesn't utilize GPU
See original GitHub issueI am performing training of the model using a custom dataset on an AWS EC2 instance (p2.xlarge) with an NVIDIA Tesla K80 GPU. After launching the training script I see full CPU utilization but no utilization of the GPU, as measured by the output of $ watch -n0.1 nvidia-smi
.
Sun Aug 11 23:04:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 55C P0 58W / 149W | 67MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4546 C python3 56MiB |
+-----------------------------------------------------------------------------+
The EC2 instance is Ubuntu 18.04 with nvidia-driver-430
installed.
The config.json
file:
{
"model" : {
"min_input_size": 288,
"max_input_size": 448,
"anchors": [0,0, 58,58, 114,193, 116,73, 193,123, 210,270, 303,187, 341,282, 373,367],
"labels": ["handgun"]
},
"train": {
"train_image_folder": "/home/ubuntu/data/yolo3/handgun/images/",
"train_annot_folder": "/home/ubuntu/data/yolo3/handgun/annotations/",
"cache_name": "handgun_train.pkl",
"train_times": 8,
"batch_size": 16,
"learning_rate": 1e-4,
"nb_epochs": 100,
"warmup_epochs": 3,
"ignore_thresh": 0.5,
"gpus": "0",
"grid_scales": [1,1,1],
"obj_scale": 5,
"noobj_scale": 1,
"xywh_scale": 1,
"class_scale": 1,
"tensorboard_dir": "logs",
"saved_weights_name": "handgun.h5",
"debug": true
},
"valid": {
"valid_image_folder": "",
"valid_annot_folder": "",
"cache_name": "",
"valid_times": 1
}
}
The output from the training script looks reasonable and the TensorBoard graphs look good (i.e. continuous drops in the loss graphs). My only concern is that I’ve not done something correctly in order to utilize the GPU so the training will likely take much longer to complete.
Can anyone comment as to what I may have done wrong? Thanks in advance for any comments or suggestions.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:5
Top Results From Across the Web
Why doesn't training RNNs use 100% of the GPU? - Quora
One symptom could be the CPU usage, since it is 100%, and that means the stream of data between the GPU and CPU...
Read more >Training not using GPU desipte having tensorflow-gpu #2336
Hello guys,. I am doing custom object detection and instance segmentation using mask rcnn. Hardware: Nvidia Geforce 1050; Dell G3 15.
Read more >tensorflow.keras not utilizing gpu - python - Stack Overflow
The model is getting loaded to GPU. So, it is not related to your GPU utilization issue. It is possible that your train_gen...
Read more >Cannot utilize fully all GPUs during network training - MathWorks
Learn more about deep learning, gpu, parallel computing toolbox, ... I see that not all threads of the CPU are in use so...
Read more >[D] Why is GPU utilization so bad when training neural ...
Not really. In any case, GPU are not designed with low latency in mind. They use latency hiding techniques to mitigate that. So,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Uninstall tensorflow and install tensorflow-gpu does maybe help?
Hi, @monocongo! I’ve faced the same problem. As I can see your current driver version is 430. Try 410. It helped me. Good luck!