question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CIFAR multi-gpu example not actually utilizing GPUs

See original GitHub issue

I am trying to run a the multi-GPU CIFAR example on YARN cluster with InfiniBand. I have followed the instructions as provided in the instruction. I execute CNN on Spark using the following command:

export NUM_GPU=2
export MEMORY=$((NUM_GPU * 11))
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
--executor-memory ${MEMORY}G \
--py-files ${TFoS_HOME}/tfspark.zip,cifar10.zip \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/cifar10/cifar10_multi_gpu_train.py \
--data_dir ${CIFAR10_DATA} \
--train_dir hdfs:///cifar10_train \
--max_steps 1000 \
--num_gpus ${NUM_GPU} \
--rdma \
--tensorboard

I have NVIDIA Tesla K80 GPUs installed on each node. Each node has two GPUs per node and each GPU has approximately 11GB of memory and that is how I am calculating the memory proxy for GPUs for YARN.

In order to check how much (if it all) of the GPUs or utilized on each node I run the nvidia-smi command but surprisingly none of the nodes show any GPU process running. The output of this command on one of the NodeManger is:


> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
> |===============================+======================+======================|
> |   0  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
> | N/A   48C    P0    60W / 149W |      0MiB / 11439MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> |   1  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
> | N/A   36C    P0    76W / 149W |      0MiB / 11439MiB |    100%      Default |
> +-------------------------------+----------------------+----------------------+
>                                                                                
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU Memory |
> |  GPU       PID  Type  Process name                               Usage      |
> |=============================================================================|
> |  No running processes found                                                 |
> +-----------------------------------------------------------------------------+

When I run ps aux I see the relevant Python daemons running on each of the NodeManger nodes. The top command also shows Python consuming quite a lot of CPU cycles.

top - 01:19:04 up 89 days, 10:21,  1 user,  load average: 3.07, 3.29, 2.56
Tasks: 555 total,   2 running, 553 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  0.1 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13173779+total, 10681584+free,  3698244 used, 21223708 buff/cache
KiB Swap:  3898308 total,  3835176 free,    63132 used. 12656761+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 
31404 javed.19  20   0 1425744  37624   6628 R  93.3  0.0  21:41.69 python                                                                  
31362 javed.19  20   0 2184100 685188  32024 S  26.7  0.5   6:25.82 java                                                                    
    1 root      20   0   41368   3672   2260 S   0.0  0.0   2:42.49 systemd                                                                 
    2 root      20   0       0      0      0 S   0.0  0.0   0:02.13 kthreadd                                                                
    3 root      20   0       0      0      0 S   0.0  0.0 146:11.03 ksoftirqd/0                                                             
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                            
    8 root      rt   0       0      0      0 S   0.0  0.0   3:02.66 migration/0  

What do you guys think is happening here? Would really appreciate any help.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
HassebJcommented, May 17, 2017

I was able to fix the issue. On my system echo $CUDA_VISIBLE_DEVICES=NoDevFiles for some reason. I have two GPUs installed so setting export CUDA_VISIBLE_DEVICES=0,1 fixes the issue and TensorFlow is able to use both of them.

0reactions
leewyangcommented, May 16, 2017

Can you try removing the --rdma argument to see if it gets further? It looks like the RDMA support might not be properly compiled. If so, will need help from @junshi15 (who’s out this week).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Learn PyTorch Multi-GPU properly - Medium
I'm Matthew, a carrot market machine learning engineer who loves PyTorch. We've organized the process for multi-GPU learning using PyTorch.
Read more >
How-To: Multi-GPU training with Keras, Python, and deep ...
In this tutorial you'll learn how you can scale Keras and train deep neural network using multiple GPUs with the Keras deep learning...
Read more >
Train Network Using Automatic Multi-GPU Support - MathWorks
This example shows how to use multiple GPUs on your local machine for deep learning training using automatic parallel support.
Read more >
Tensorflow CIFAR10 example: GPUs not fully utilized
However the GPUs are just used by 25%. Can somebody explain what the bottleneck is and how I can fix it? Are the...
Read more >
13.5. Training on Multiple GPUs - Dive into Deep Learning
What we did not discuss is how to actually parallelize deep learning training. ... minibatch stochastic gradient descent using data parallelism on two...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found