Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CIFAR multi-gpu example not actually utilizing GPUs

See original GitHub issue

I am trying to run a the multi-GPU CIFAR example on YARN cluster with InfiniBand. I have followed the instructions as provided in the instruction. I execute CNN on Spark using the following command:

export NUM_GPU=2
export MEMORY=$((NUM_GPU * 11))
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
--executor-memory ${MEMORY}G \
--py-files ${TFoS_HOME}/tfspark.zip,cifar10.zip \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/cifar10/cifar10_multi_gpu_train.py \
--data_dir ${CIFAR10_DATA} \
--train_dir hdfs:///cifar10_train \
--max_steps 1000 \
--num_gpus ${NUM_GPU} \
--rdma \
--tensorboard

I have NVIDIA Tesla K80 GPUs installed on each node. Each node has two GPUs per node and each GPU has approximately 11GB of memory and that is how I am calculating the memory proxy for GPUs for YARN.

In order to check how much (if it all) of the GPUs or utilized on each node I run the nvidia-smi command but surprisingly none of the nodes show any GPU process running. The output of this command on one of the NodeManger is:


> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
> |===============================+======================+======================|
> |   0  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
> | N/A   48C    P0    60W / 149W |      0MiB / 11439MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> |   1  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
> | N/A   36C    P0    76W / 149W |      0MiB / 11439MiB |    100%      Default |
> +-------------------------------+----------------------+----------------------+
>                                                                                
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU Memory |
> |  GPU       PID  Type  Process name                               Usage      |
> |=============================================================================|
> |  No running processes found                                                 |
> +-----------------------------------------------------------------------------+

When I run ps aux I see the relevant Python daemons running on each of the NodeManger nodes. The top command also shows Python consuming quite a lot of CPU cycles.

top - 01:19:04 up 89 days, 10:21,  1 user,  load average: 3.07, 3.29, 2.56
Tasks: 555 total,   2 running, 553 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  0.1 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13173779+total, 10681584+free,  3698244 used, 21223708 buff/cache
KiB Swap:  3898308 total,  3835176 free,    63132 used. 12656761+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 
31404 javed.19  20   0 1425744  37624   6628 R  93.3  0.0  21:41.69 python                                                                  
31362 javed.19  20   0 2184100 685188  32024 S  26.7  0.5   6:25.82 java                                                                    
    1 root      20   0   41368   3672   2260 S   0.0  0.0   2:42.49 systemd                                                                 
    2 root      20   0       0      0      0 S   0.0  0.0   0:02.13 kthreadd                                                                
    3 root      20   0       0      0      0 S   0.0  0.0 146:11.03 ksoftirqd/0                                                             
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                            
    8 root      rt   0       0      0      0 S   0.0  0.0   3:02.66 migration/0

What do you guys think is happening here? Would really appreciate any help.

Issue Analytics

State:
Created 6 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

HassebJcommented, May 17, 2017

I was able to fix the issue. On my system echo $CUDA_VISIBLE_DEVICES=NoDevFiles for some reason. I have two GPUs installed so setting export CUDA_VISIBLE_DEVICES=0,1 fixes the issue and TensorFlow is able to use both of them.

0reactions

leewyangcommented, May 16, 2017

Can you try removing the --rdma argument to see if it gets further? It looks like the RDMA support might not be properly compiled. If so, will need help from @junshi15 (who’s out this week).

Top Results From Across the Web

Learn PyTorch Multi-GPU properly - Medium

I'm Matthew, a carrot market machine learning engineer who loves PyTorch. We've organized the process for multi-GPU learning using PyTorch.

How-To: Multi-GPU training with Keras, Python, and deep ...

In this tutorial you'll learn how you can scale Keras and train deep neural network using multiple GPUs with the Keras deep learning...

Train Network Using Automatic Multi-GPU Support - MathWorks

This example shows how to use multiple GPUs on your local machine for deep learning training using automatic parallel support.

Tensorflow CIFAR10 example: GPUs not fully utilized

However the GPUs are just used by 25%. Can somebody explain what the bottleneck is and how I can fix it? Are the...

13.5. Training on Multiple GPUs - Dive into Deep Learning

What we did not discuss is how to actually parallelize deep learning training. ... minibatch stochastic gradient descent using data parallelism on two...