question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

no gpu found on remote mode(fixed) & gpuindices works not properly

See original GitHub issue

Describe the issue:

Environment:

  • NNI version: 2.5
  • Training service (local|remote|pai|aml|etc): remote
  • Server OS (for remote mode only):ubuntu 18.04
  • Python version:3.6.9
  • PyTorch/TensorFlow version: pytorch 1.6.0+cu101
  • Is running in Docker?: YES

Configuration:

  • Experiment config (remember to remove secrets!): Mnist-pytorch
  • Search space:

Log message:

  • nnimanager.log: [2022-01-12 14:41:40] INFO (NNIManager) Trial job v6Q3x status changed from WAITING to RUNNING [2022-01-12 14:41:40] INFO (NNIManager) Trial job w1IS6 status changed from WAITING to RUNNING [2022-01-12 14:41:40] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:41] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:42] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:43] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:44] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:45] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:45] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:46] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:47] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:48] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:49] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:50] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:51] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:51] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:52] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:53] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:55] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:55] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:56] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
  • dispatcher.log: [2022-01-12 14:41:30] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001629 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001548 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001542 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001561 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001513 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001584 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001514 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001543 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials

I am just using the provided example trial—mnist-pytoch. The only change is config file. Here is my config.yml

nniManagerIp: 172.17.0.18
experimentName: MNIST
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 2
trialConcurrency: 8
experimentWorkingDirectory: '/app/nni/experiments'
maxTrialNumber: 20
useAnnotation: false
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 10.10.30.25
      port: 8888
      user: root
      password: masike
      maxTrialNumberPerGpu: 4
      gpuIndices: '1,2'

The web UI keeps showing running and waiting without any progress. The GPU NO.1 and 2 are not shown to be used, but I can see the python3 nni progress through htop I am planning to test how to use nni to training on server GPUs, but this issue happened. I also tried to reinstall NNI by the wheel. Any suggestions would be appreciated!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
bl6g6commented, Jan 13, 2022

I restarted the server and it works.

0reactions
bl6g6commented, Jan 24, 2022

在上面的config配置下,我有个4卡服务器,做了以下两个实验:

  1. 如果四个gpu都是空闲的,trial会在每个gpu都占1000MB左右(每个gpu内存为11000MB)。最终都会success跑完。
  2. 我随便跑了个程序占满其中一个gpu,那么就会有几个trials fail,剩余的success,我跑的程序也不会被卡掉。

bug是无法占满gpu也无法控制制定gpu,望解决。 我目前这个docker是在一个镜像上docker run建立双端口,然后在docker里安装的环境。

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVIDIA GPUDirect Storage Installation and Troubleshooting ...
This guide addresses questions related to the GDS installation and helps you triage functionality and performance issues. For non-GDS issues, contact the ...
Read more >
SLES 15 SP3 | NVIDIA Virtual GPU for KVM Guests
NVIDIA virtual GPU (vGPU) is a graphics virtualization solution that provides multiple virtual machines (VMs) simultaneous access to one ...
Read more >
How to Fix NVIDIA Control Panel Not Opening On ... - YouTube
Learn how to fix the NVIDIA control panel not opening problem. ... Download NVIDIA GPU Drivers https://www. nvidia.com/Download/ index.
Read more >
nvidia-smi: Control Your GPUs - Microway
On Windows, nvidia-smi is not able to set persistence mode. Instead, you need to set your computational GPUs to TCC mode. This should...
Read more >
How to Install TensorFlow with GPU Support on Windows 10 ...
This is a detailed guide for getting the latest TensorFlow working with GPU acceleration without needing to do a CUDA install.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found