NNI remote mode is working beyond expectation
See original GitHub issueI have met some strange behavior of nni in remote mode.
In brief, what happened to me is the same as @guoxiaojie-schinper
I am running the demo of nni repo, /example/trial/mnist-pytorch
If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal.
But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn’t work exactly same as @guoxiaojie-schinper .
In detail,
Environment: NNI on both local and remote are install by python3 -m pip install --upgrade nni
in conda environment
config_remote.yml (if use remote):
searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.251
debug: true
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: remote
machineList:
- host: 10.113.217.230
user: root
sshKeyFile: ~/.ssh/nni_docker
port: 8145
pythonPath: /opt/conda/envs/py38torch190cu111/bin
useActiveGpu: true
maxTrialNumberPerGpu: 8
config_remote.yml (if use in local):
searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.230
debug: true
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: local
useActiveGpu: true
maxTrialNumberPerGpu: 8
Description:
- If I run the code and config (local mode, using the second yml file) in remote machine locally, every setting is running as expected. The task in gpu is the same as the trialConcurrency and gpu is used by nni exactly, the speed of output waiting time is also as expectation.
- If I run the code and config (remote mode, using the first yml file) in local machine (MacBook Pro with latest OS) connecting to the remote machine, some strange phenomena are occurred. I list below.
-> 2.1 If I set the trialGpuNumber = 1
, trialCommand = python3 mint.py
, the phenomenon is the same as @guoxiaojie-schinper . All the tasks show waiting status forever. And the NNIManager Log shows the INFO: INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
And use top and nvidia-smi in remote machine can confirm that that task is not running exactly, due to low cpu usage and no related gpu process. And the waiting status can be kept up to several hours and still be waiting there. And no matter if the nvidia-smi is added to the trialCommand
(just like the commented one trialCommand), the phenomenon is the same.
-> 2.2 If I set the trialGpuNumber > 1
, nni will tell me over the limitation and none of the machine can reach that. And in fact my remote machine has only one gpu. This situation is reasonable.
-> 2.3 If I set the trialGpuNumber = 0
, and the trialCommand = python3 mnist.py
, and inside or outside the docker, even though my trialConcurrency = 4
, only one task is running and another 3 keep waiting until the running one finish. Unlike [1] and [2] above, all the task are waiting forever, in this condition, the task is running one by one beyond the control of the augment trialConcurrency
. And the task is run by cpu (waiting up to 4 min for this mnist demo of your nni repo to reach full 800% usage of cpu with 8 kernel i7 cpu). But just take a longer time than gpu, it still can run without waiting forever.
(The information provided in right side sentence is not very confident because I don’t remember exactly or situation show up rarely) And maybe something will randomly use the gpu in first task or several tasks.
-> 2.4 If I set the trialGpuNumber = 0
again, but add the nvidia-smi
to the trialCommand
, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py
, and outside the docker, the task will run with gpu after about 4 min, which is much slower than normal one in 2.5. The gpu usage I have checked in nvidia-smi command in remote machine (a related process show up in nvidia-smi) and the output speed of nni can also confirm these phenomenon. But it seems that it seems that the next task will use more time to wait for gpu usage.
-> 2.5 If I set the trialGpuNumber = 0
again, but add the nvidia-smi
to the trialCommand
, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py
, and inside the docker, the task will run with gpu forever and normally, but still one by one.
_Originally posted by @OuYaozhong in https://github.com/microsoft/nni/issues/3905#issuecomment-893651120_
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (9 by maintainers)
Top GitHub Comments
@acured Hi, thanks for your help. Your trial help me to solve the problem.
In fact, I found that the problem is located at
reuseMode
instead of thetrialConcurrency
orpythonPath
.If I set the
reuseMode: false
, both docker and the outside environment will work normally just like running in local mode. The task running simultaneously by the setting oftrialConcurrency
and use gpu quickly.But it is strange that why this is relative to
reuseMode
?Before I open this issue, I have noticed this parameter and think it isn’t relative to this problem, because it seems that is use to accelerate the remote training.
Can you do some explanation?
Well, thanks for you helping. That’s enough.