NNI is infinitely waiting when running in remote mode
See original GitHub issueDiscussed in https://github.com/microsoft/nni/discussions/4070
<div type='discussions-op-text'>Originally posted by ZhiyuanChen August 14, 2021
[2021-08-14 10:13:41] INFO (NNIDataStore) Datastore initialization done
[2021-08-14 10:13:41] INFO (RestServer) RestServer start
[2021-08-14 10:13:41] INFO (RestServer) RestServer base port is 8080
[2021-08-14 10:13:41] INFO (main) Rest server listening on: http://0.0.0.0:8080
[2021-08-14 10:13:42] INFO (NNIManager) Starting experiment: VBgChK3z
[2021-08-14 10:13:42] INFO (NNIManager) Setup training service...
[2021-08-14 10:13:42] INFO (TrialDispatcher) TrialDispatcher: GPU scheduler is enabled.
[2021-08-14 10:13:42] INFO (RemoteEnvironmentService) connecting to machine1
[2021-08-14 10:13:42] INFO (RemoteEnvironmentService) connecting to machine2
[2021-08-14 10:13:42] INFO (NNIManager) Setup tuner...
[2021-08-14 10:13:42] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2021-08-14 10:13:42] INFO (NNIManager) Add event listeners
[2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.004360754539476665}, "parameter_index": 0}
[2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.0035240674041291577}, "parameter_index": 0}
[2021-08-14 10:13:44] INFO (RemoteEnvironmentService) ssh connection initialized!
[2021-08-14 10:13:44] INFO (TrialDispatcher) TrialDispatcher: started channel: WebCommandChannel
[2021-08-14 10:13:44] INFO (TrialDispatcher) TrialDispatcher: copying code and settings.
[2021-08-14 10:13:44] INFO (TrialDispatcher) Initialize environments total number: 2
[2021-08-14 10:13:44] INFO (TrialDispatcher) Assign environment service remote to environment fzYKh
[2021-08-14 10:13:45] INFO (TrialDispatcher) requested environment fzYKh and job id is nni_exp_VBgChK3z_env_fzYKh.
[2021-08-14 10:13:45] INFO (TrialDispatcher) Assign environment service remote to environment D6h8H
[2021-08-14 10:13:46] INFO (TrialDispatcher) requested environment D6h8H and job id is nni_exp_VBgChK3z_env_D6h8H.
[2021-08-14 10:13:46] INFO (TrialDispatcher) TrialDispatcher: run loop started.
[2021-08-14 10:13:47] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.004360754539476665}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2021-08-14 10:13:47] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 1,
hyperParameters: {
value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.0035240674041291577}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
```</div>
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
NNI is infinitely waiting when running in remote mode #4070
NNI is infinitely waiting when running in remote mode.
Read more >Calix E-Series (E7 OS R3.1-3.2) Maintenance and ...
Configuring Alarm Monitoring for Remote Powering (Span Power) on the E3-48C ... WAITING: The phone is in Call Waiting mode.
Read more >TS 126 114 - V10.9.0 - Universal Mobile Telecommunications ... - ETSI
A negative value indicates an infinite waiting time, i.e. to prevent up- switch for the whole remaining session. Default value: 5 seconds.
Read more >torch.utils.data.dataloader - Neural Network Intelligence
/_utils` we define many utility methods and functions to be run in multiprocessing. E.g., the data loading worker loop is in `./_utils/worker.py`.
Read more >OmniSwitch AOS Release 8 CLI Reference Guide
The link-status of the remote port will be down when the WTS timer is running. This is due to the port being physically...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
After I dive into the log on remote machine, I found it is because the port on master machine is not opened. Consider raise an Error in such case
Maybe check if the port is opened first and raise a systemerror?