Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training doesn't utilize GPU

See original GitHub issue

I am performing training of the model using a custom dataset on an AWS EC2 instance (p2.xlarge) with an NVIDIA Tesla K80 GPU. After launching the training script I see full CPU utilization but no utilization of the GPU, as measured by the output of $ watch -n0.1 nvidia-smi.

Sun Aug 11 23:04:01 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   55C    P0    58W / 149W |     67MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4546      C   python3                                       56MiB |
+-----------------------------------------------------------------------------+

The EC2 instance is Ubuntu 18.04 with nvidia-driver-430 installed.

The config.json file:

{
    "model" : {
        "min_input_size":       288,
        "max_input_size":       448,
        "anchors":              [0,0, 58,58, 114,193, 116,73, 193,123, 210,270, 303,187, 341,282, 373,367],
        "labels":               ["handgun"]
    },

    "train": {
        "train_image_folder":   "/home/ubuntu/data/yolo3/handgun/images/",
        "train_annot_folder":   "/home/ubuntu/data/yolo3/handgun/annotations/",
        "cache_name":           "handgun_train.pkl",

        "train_times":          8,
        "batch_size":           16,
        "learning_rate":        1e-4,
        "nb_epochs":            100,
        "warmup_epochs":        3,
        "ignore_thresh":        0.5,
        "gpus":                 "0",

        "grid_scales":          [1,1,1],
        "obj_scale":            5,
        "noobj_scale":          1,
        "xywh_scale":           1,
        "class_scale":          1,

        "tensorboard_dir":      "logs",
        "saved_weights_name":   "handgun.h5",
        "debug":                true
    },

    "valid": {
        "valid_image_folder":   "",
        "valid_annot_folder":   "",
        "cache_name":           "",

        "valid_times":          1
    }
}

The output from the training script looks reasonable and the TensorBoard graphs look good (i.e. continuous drops in the loss graphs). My only concern is that I’ve not done something correctly in order to utilize the GPU so the training will likely take much longer to complete.

Can anyone comment as to what I may have done wrong? Thanks in advance for any comments or suggestions.