question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training doesn't utilize GPU

See original GitHub issue

I am performing training of the model using a custom dataset on an AWS EC2 instance (p2.xlarge) with an NVIDIA Tesla K80 GPU. After launching the training script I see full CPU utilization but no utilization of the GPU, as measured by the output of $ watch -n0.1 nvidia-smi.

Sun Aug 11 23:04:01 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   55C    P0    58W / 149W |     67MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4546      C   python3                                       56MiB |
+-----------------------------------------------------------------------------+

The EC2 instance is Ubuntu 18.04 with nvidia-driver-430 installed.

The config.json file:

{
    "model" : {
        "min_input_size":       288,
        "max_input_size":       448,
        "anchors":              [0,0, 58,58, 114,193, 116,73, 193,123, 210,270, 303,187, 341,282, 373,367],
        "labels":               ["handgun"]
    },

    "train": {
        "train_image_folder":   "/home/ubuntu/data/yolo3/handgun/images/",
        "train_annot_folder":   "/home/ubuntu/data/yolo3/handgun/annotations/",
        "cache_name":           "handgun_train.pkl",

        "train_times":          8,
        "batch_size":           16,
        "learning_rate":        1e-4,
        "nb_epochs":            100,
        "warmup_epochs":        3,
        "ignore_thresh":        0.5,
        "gpus":                 "0",

        "grid_scales":          [1,1,1],
        "obj_scale":            5,
        "noobj_scale":          1,
        "xywh_scale":           1,
        "class_scale":          1,

        "tensorboard_dir":      "logs",
        "saved_weights_name":   "handgun.h5",
        "debug":                true
    },

    "valid": {
        "valid_image_folder":   "",
        "valid_annot_folder":   "",
        "cache_name":           "",

        "valid_times":          1
    }
}

The output from the training script looks reasonable and the TensorBoard graphs look good (i.e. continuous drops in the loss graphs). My only concern is that I’ve not done something correctly in order to utilize the GPU so the training will likely take much longer to complete.

Can anyone comment as to what I may have done wrong? Thanks in advance for any comments or suggestions.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:5

github_iconTop GitHub Comments

1reaction
andreasmarxercommented, Sep 26, 2019

Uninstall tensorflow and install tensorflow-gpu does maybe help?

1reaction
ivankunyankincommented, Aug 13, 2019

Hi, @monocongo! I’ve faced the same problem. As I can see your current driver version is 430. Try 410. It helped me. Good luck!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why doesn't training RNNs use 100% of the GPU? - Quora
One symptom could be the CPU usage, since it is 100%, and that means the stream of data between the GPU and CPU...
Read more >
Training not using GPU desipte having tensorflow-gpu #2336
Hello guys,. I am doing custom object detection and instance segmentation using mask rcnn. Hardware: Nvidia Geforce 1050; Dell G3 15.
Read more >
tensorflow.keras not utilizing gpu - python - Stack Overflow
The model is getting loaded to GPU. So, it is not related to your GPU utilization issue. It is possible that your train_gen...
Read more >
Cannot utilize fully all GPUs during network training - MathWorks
Learn more about deep learning, gpu, parallel computing toolbox, ... I see that not all threads of the CPU are in use so...
Read more >
[D] Why is GPU utilization so bad when training neural ...
Not really. In any case, GPU are not designed with low latency in mind. They use latency hiding techniques to mitigate that. So,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found