no gpu found on remote mode(fixed) & gpuindices works not properly
See original GitHub issueDescribe the issue:
Environment:
- NNI version: 2.5
- Training service (local|remote|pai|aml|etc): remote
- Server OS (for remote mode only):ubuntu 18.04
- Python version:3.6.9
- PyTorch/TensorFlow version: pytorch 1.6.0+cu101
- Is running in Docker?: YES
Configuration:
- Experiment config (remember to remove secrets!): Mnist-pytorch
- Search space:
Log message:
- nnimanager.log: [2022-01-12 14:41:40] INFO (NNIManager) Trial job v6Q3x status changed from WAITING to RUNNING [2022-01-12 14:41:40] INFO (NNIManager) Trial job w1IS6 status changed from WAITING to RUNNING [2022-01-12 14:41:40] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:41] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:42] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:43] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:44] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:45] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:45] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:46] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:47] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:48] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:49] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:50] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:51] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:51] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:52] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:53] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:55] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:55] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. [2022-01-12 14:41:56] INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
- dispatcher.log: [2022-01-12 14:41:30] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001629 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001548 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001542 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001561 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001513 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001584 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001514 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001543 seconds [2022-01-12 14:41:31] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
I am just using the provided example trial—mnist-pytoch. The only change is config file. Here is my config.yml
nniManagerIp: 172.17.0.18
experimentName: MNIST
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 2
trialConcurrency: 8
experimentWorkingDirectory: '/app/nni/experiments'
maxTrialNumber: 20
useAnnotation: false
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: remote
machineList:
- host: 10.10.30.25
port: 8888
user: root
password: masike
maxTrialNumberPerGpu: 4
gpuIndices: '1,2'
The web UI keeps showing running and waiting without any progress.
The GPU NO.1 and 2 are not shown to be used, but I can see the python3 nni progress through htop
I am planning to test how to use nni to training on server GPUs, but this issue happened. I also tried to reinstall NNI by the wheel.
Any suggestions would be appreciated!
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
NVIDIA GPUDirect Storage Installation and Troubleshooting ...
This guide addresses questions related to the GDS installation and helps you triage functionality and performance issues. For non-GDS issues, contact the ...
Read more >SLES 15 SP3 | NVIDIA Virtual GPU for KVM Guests
NVIDIA virtual GPU (vGPU) is a graphics virtualization solution that provides multiple virtual machines (VMs) simultaneous access to one ...
Read more >How to Fix NVIDIA Control Panel Not Opening On ... - YouTube
Learn how to fix the NVIDIA control panel not opening problem. ... Download NVIDIA GPU Drivers https://www. nvidia.com/Download/ index.
Read more >nvidia-smi: Control Your GPUs - Microway
On Windows, nvidia-smi is not able to set persistence mode. Instead, you need to set your computational GPUs to TCC mode. This should...
Read more >How to Install TensorFlow with GPU Support on Windows 10 ...
This is a detailed guide for getting the latest TensorFlow working with GPU acceleration without needing to do a CUDA install.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I restarted the server and it works.
在上面的config配置下,我有个4卡服务器,做了以下两个实验:
bug是无法占满gpu也无法控制制定gpu,望解决。 我目前这个docker是在一个镜像上docker run建立双端口,然后在docker里安装的环境。