question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NNI is infinitely waiting when running in remote mode

See original GitHub issue

Discussed in https://github.com/microsoft/nni/discussions/4070

<div type='discussions-op-text'>

Originally posted by ZhiyuanChen August 14, 2021

[2021-08-14 10:13:41] INFO (NNIDataStore) Datastore initialization done
[2021-08-14 10:13:41] INFO (RestServer) RestServer start
[2021-08-14 10:13:41] INFO (RestServer) RestServer base port is 8080
[2021-08-14 10:13:41] INFO (main) Rest server listening on: http://0.0.0.0:8080
[2021-08-14 10:13:42] INFO (NNIManager) Starting experiment: VBgChK3z
[2021-08-14 10:13:42] INFO (NNIManager) Setup training service...
[2021-08-14 10:13:42] INFO (TrialDispatcher) TrialDispatcher: GPU scheduler is enabled.
[2021-08-14 10:13:42] INFO (RemoteEnvironmentService) connecting to machine1
[2021-08-14 10:13:42] INFO (RemoteEnvironmentService) connecting to machine2
[2021-08-14 10:13:42] INFO (NNIManager) Setup tuner...
[2021-08-14 10:13:42] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2021-08-14 10:13:42] INFO (NNIManager) Add event listeners
[2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: ID, 
[2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.004360754539476665}, "parameter_index": 0}
[2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.0035240674041291577}, "parameter_index": 0}
[2021-08-14 10:13:44] INFO (RemoteEnvironmentService) ssh connection initialized!
[2021-08-14 10:13:44] INFO (TrialDispatcher) TrialDispatcher: started channel: WebCommandChannel
[2021-08-14 10:13:44] INFO (TrialDispatcher) TrialDispatcher: copying code and settings.
[2021-08-14 10:13:44] INFO (TrialDispatcher) Initialize environments total number: 2
[2021-08-14 10:13:44] INFO (TrialDispatcher) Assign environment service remote to environment fzYKh
[2021-08-14 10:13:45] INFO (TrialDispatcher) requested environment fzYKh and job id is nni_exp_VBgChK3z_env_fzYKh.
[2021-08-14 10:13:45] INFO (TrialDispatcher) Assign environment service remote to environment D6h8H
[2021-08-14 10:13:46] INFO (TrialDispatcher) requested environment D6h8H and job id is nni_exp_VBgChK3z_env_D6h8H.
[2021-08-14 10:13:46] INFO (TrialDispatcher) TrialDispatcher: run loop started.
[2021-08-14 10:13:47] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 0,
  hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.004360754539476665}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2021-08-14 10:13:47] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 1,
  hyperParameters: {
    value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.0035240674041291577}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}

```</div>

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
ZhiyuanChencommented, Aug 16, 2021

After I dive into the log on remote machine, I found it is because the port on master machine is not opened. Consider raise an Error in such case

2reactions
ZhiyuanChencommented, Aug 20, 2021

More precisely, it is resulted since nni require an additional port (8081 in my case) for web socket communication.

for the training services other than local, yes, one more port is needed

Maybe check if the port is opened first and raise a systemerror?

Read more comments on GitHub >

github_iconTop Results From Across the Web

NNI is infinitely waiting when running in remote mode #4070
NNI is infinitely waiting when running in remote mode.
Read more >
Calix E-Series (E7 OS R3.1-3.2) Maintenance and ...
Configuring Alarm Monitoring for Remote Powering (Span Power) on the E3-48C ... WAITING: The phone is in Call Waiting mode.
Read more >
TS 126 114 - V10.9.0 - Universal Mobile Telecommunications ... - ETSI
A negative value indicates an infinite waiting time, i.e. to prevent up- switch for the whole remaining session. Default value: 5 seconds.
Read more >
torch.utils.data.dataloader - Neural Network Intelligence
/_utils` we define many utility methods and functions to be run in multiprocessing. E.g., the data loading worker loop is in `./_utils/worker.py`.
Read more >
OmniSwitch AOS Release 8 CLI Reference Guide
The link-status of the remote port will be down when the WTS timer is running. This is due to the port being physically...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found