Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NNI remote mode is working beyond expectation

See original GitHub issue

I have met some strange behavior of nni in remote mode.

In brief, what happened to me is the same as @guoxiaojie-schinper

I am running the demo of nni repo, /example/trial/mnist-pytorch

If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal.

But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn’t work exactly same as @guoxiaojie-schinper .

In detail,

Environment: NNI on both local and remote are install by python3 -m pip install --upgrade nni in conda environment

config_remote.yml (if use remote):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.251
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 10.113.217.230 
      user: root
      sshKeyFile: ~/.ssh/nni_docker
      port: 8145
      pythonPath: /opt/conda/envs/py38torch190cu111/bin
      useActiveGpu: true
      maxTrialNumberPerGpu: 8

config_remote.yml (if use in local):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.230
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: local
  useActiveGpu: true
  maxTrialNumberPerGpu: 8

Description:

If I run the code and config (local mode, using the second yml file) in remote machine locally, every setting is running as expected. The task in gpu is the same as the trialConcurrency and gpu is used by nni exactly, the speed of output waiting time is also as expectation.
If I run the code and config (remote mode, using the first yml file) in local machine (MacBook Pro with latest OS) connecting to the remote machine, some strange phenomena are occurred. I list below.

-> 2.1 If I set the trialGpuNumber = 1, trialCommand = python3 mint.py, the phenomenon is the same as @guoxiaojie-schinper . All the tasks show waiting status forever. And the NNIManager Log shows the INFO: INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. And use top and nvidia-smi in remote machine can confirm that that task is not running exactly, due to low cpu usage and no related gpu process. And the waiting status can be kept up to several hours and still be waiting there. And no matter if the nvidia-smi is added to the trialCommand (just like the commented one trialCommand), the phenomenon is the same.

-> 2.2 If I set the trialGpuNumber > 1, nni will tell me over the limitation and none of the machine can reach that. And in fact my remote machine has only one gpu. This situation is reasonable.

-> 2.3 If I set the trialGpuNumber = 0, and the trialCommand = python3 mnist.py, and inside or outside the docker, even though my trialConcurrency = 4, only one task is running and another 3 keep waiting until the running one finish. Unlike [1] and [2] above, all the task are waiting forever, in this condition, the task is running one by one beyond the control of the augment trialConcurrency. And the task is run by cpu (waiting up to 4 min for this mnist demo of your nni repo to reach full 800% usage of cpu with 8 kernel i7 cpu). But just take a longer time than gpu, it still can run without waiting forever. (The information provided in right side sentence is not very confident because I don’t remember exactly or situation show up rarely) And maybe something will randomly use the gpu in first task or several tasks.

-> 2.4 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and outside the docker, the task will run with gpu after about 4 min, which is much slower than normal one in 2.5. The gpu usage I have checked in nvidia-smi command in remote machine (a related process show up in nvidia-smi) and the output speed of nni can also confirm these phenomenon. But it seems that it seems that the next task will use more time to wait for gpu usage.

-> 2.5 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and inside the docker, the task will run with gpu forever and normally, but still one by one.

_Originally posted by @OuYaozhong in https://github.com/microsoft/nni/issues/3905#issuecomment-893651120_

Issue Analytics

State:
Created 2 years ago
Comments:18 (9 by maintainers)

Top GitHub Comments

1reaction

OuYaozhongcommented, Aug 12, 2021

@acured Hi, thanks for your help. Your trial help me to solve the problem.

In fact, I found that the problem is located at reuseMode instead of the trialConcurrency or pythonPath.

If I set the reuseMode: false, both docker and the outside environment will work normally just like running in local mode. The task running simultaneously by the setting of trialConcurrency and use gpu quickly.

But it is strange that why this is relative to reuseMode?

Before I open this issue, I have noticed this parameter and think it isn’t relative to this problem, because it seems that is use to accelerate the remote training.

Can you do some explanation?

0reactions

OuYaozhongcommented, Aug 13, 2021

It also works for me when I set reuse is ture.

Well, thanks for you helping. That’s enough.

Top Results From Across the Web

Run an Experiment on AdaptDL - Read the Docs

Now NNI supports running experiment on AdaptDL. Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or...

Microsoft/nni - Gitter

Hellow everybody!I have some problem! I use nni to run the mnist example with pytorch,but the final result can't show on the WebUI.The...

Introducing the Specifications of the Metro Ethernet Forum - MEF

NNI : Network to Network Interface, E-NNI: External NNI; I-NNI Internal NNI ... Physical Layer (speed, mode, and physical medium),. • MAC Layer,....

Remote Managers Are Having Trust Issues

Research conducted during Covid-19 shows that a large number of managers are struggling with the effective management of people working from ...

Troubleshooting issues sending logs to a remote syslog server

After you have configured the BIG-IP system to log to a remote syslog server, ... to confirm that the BIG-IP system is behaving...