CIFAR multi-gpu example not actually utilizing GPUs
See original GitHub issueI am trying to run a the multi-GPU CIFAR example on YARN cluster with InfiniBand. I have followed the instructions as provided in the instruction. I execute CNN on Spark using the following command:
export NUM_GPU=2
export MEMORY=$((NUM_GPU * 11))
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
--executor-memory ${MEMORY}G \
--py-files ${TFoS_HOME}/tfspark.zip,cifar10.zip \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/cifar10/cifar10_multi_gpu_train.py \
--data_dir ${CIFAR10_DATA} \
--train_dir hdfs:///cifar10_train \
--max_steps 1000 \
--num_gpus ${NUM_GPU} \
--rdma \
--tensorboard
I have NVIDIA Tesla K80 GPUs installed on each node. Each node has two GPUs per node and each GPU has approximately 11GB of memory and that is how I am calculating the memory proxy for GPUs for YARN.
In order to check how much (if it all) of the GPUs or utilized on each node I run the nvidia-smi
command but surprisingly none of the nodes show any GPU process running. The output of this command on one of the NodeManger is:
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 367.48 Driver Version: 367.48 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
> |===============================+======================+======================|
> | 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
> | N/A 48C P0 60W / 149W | 0MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
> | 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
> | N/A 36C P0 76W / 149W | 0MiB / 11439MiB | 100% Default |
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU Memory |
> | GPU PID Type Process name Usage |
> |=============================================================================|
> | No running processes found |
> +-----------------------------------------------------------------------------+
When I run ps aux
I see the relevant Python daemons running on each of the NodeManger nodes. The top
command also shows Python consuming quite a lot of CPU cycles.
top - 01:19:04 up 89 days, 10:21, 1 user, load average: 3.07, 3.29, 2.56
Tasks: 555 total, 2 running, 553 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.1 us, 0.1 sy, 0.0 ni, 98.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13173779+total, 10681584+free, 3698244 used, 21223708 buff/cache
KiB Swap: 3898308 total, 3835176 free, 63132 used. 12656761+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31404 javed.19 20 0 1425744 37624 6628 R 93.3 0.0 21:41.69 python
31362 javed.19 20 0 2184100 685188 32024 S 26.7 0.5 6:25.82 java
1 root 20 0 41368 3672 2260 S 0.0 0.0 2:42.49 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:02.13 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 146:11.03 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root rt 0 0 0 0 S 0.0 0.0 3:02.66 migration/0
What do you guys think is happening here? Would really appreciate any help.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
I was able to fix the issue. On my system
echo $CUDA_VISIBLE_DEVICES=NoDevFiles
for some reason. I have two GPUs installed so settingexport CUDA_VISIBLE_DEVICES=0,1
fixes the issue and TensorFlow is able to use both of them.Can you try removing the
--rdma
argument to see if it gets further? It looks like the RDMA support might not be properly compiled. If so, will need help from @junshi15 (who’s out this week).