[CLI]: Can't find port file when using wandb.require("service")
See original GitHub issueDescribe the bug
When running the code snippet below using WanDB + PyTorchLightning on SLURM. The code crashes randomly for most if not all runs.
logger = WandbLogger(project="MNIST")
Traceback (most recent call last):
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
wi.setup(kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
self._wl = wandb_setup.setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
ret = _setup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
wl = _WandbSetup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
self._setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
self._setup_manager()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
self._manager = wandb_manager._Manager(
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
self._service.start()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
self._launch_server()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
assert ports_found
AssertionError
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
wi.setup(kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
self._wl = wandb_setup.setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
ret = _setup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
wl = _WandbSetup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
self._setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
self._setup_manager()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
self._manager = wandb_manager._Manager(
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
self._service.start()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
self._launch_server()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
assert ports_found
AssertionError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home-mscluster/mfokam/assa/scripts/pretrain_eval.py", line 140, in train_eval
logger = WandbLogger(
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 311, in __init__
_ = self.experiment
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
return get_experiment() or DummyExperiment()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
return fn(self)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 357, in experiment
self._experiment = wandb.init(**self._wandb_init)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1037, in init
raise Exception("problem") from error_seen
Exception: problem
Traceback (most recent call last):
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/__main__.py", line 3, in <module>
cli.cli(prog_name="python -m wandb")
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 96, in wrapper
return func(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 285, in service
server.serve()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 128, in serve
self._inform_used_ports(grpc_port=grpc_port, sock_port=sock_port)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 65, in _inform_used_ports
pf.write(self._port_fname)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/port_file.py", line 25, in write
f = tempfile.NamedTemporaryFile(prefix=bname, dir=dname, mode="w", delete=False)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 540, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mfokam/tmpel7x5eip/port-21750.txti_81fekt'
Additional Files
No response
Environment
WandB version: 0.12.20
OS: Ubuntu 18.04.6 LTS
Python version: 3.8
Versions of relevant libraries: PyTorch: 1.12.0 PyTorch Lightning: 1.6.4
Additional Context
- The code seems to crash when executed on a very slow cluster node
- Tried to reproduce the error locally but I can only get a similar error if I insert a breakpoint point on the tempfile.py method executed (see last 6 lines of the stack trace) and I wait. After a certain period (3 - 5 seconds), I get an error similar to what I have on SLURM.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:15 (1 by maintainers)
Top Results From Across the Web
[CLI]: Can't find port file when using wandb.require("service")
This error occurs when wandb.init is called. Therefore, the wandb folder is not yet created. In my case, I can confirm that this...
Read more >Wandb fails at init (assert ports_found) - W&B Help
Potentially a duplicate of [CLI]: Can't find port file when using wandb.require("service") · Issue #3911 · wandb/wandb · GitHub
Read more >FAQ - Documentation - Weights & Biases - WandB
Frequently asked questions about setting up privately-hosted versions of our app.
Read more >Command Line Interface - Documentation - Weights & Biases
Command Line Interface ; enabled. Enable W&B. ; init. Configure a directory with Weights & Biases ; launch. Launch or queue a job...
Read more >Troubleshooting - Documentation - Weights & Biases - WandB
3. Try running W&B Private Hosting, which operates on your machine and doesn't sync files to our cloud servers.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hi, @anmolmann @ArnolFokam , I had the same problem when I ran the code on a very slow cluster node. I found out that the reason for the error is because of the magic number 30 in the code below https://github.com/wandb/wandb/blob/master/wandb/sdk/service/service.py#L41
def _wait_for_ports(self, fname: str, proc: subprocess.Popen = None) -> bool:time_max = time.time() + 30When I increase the max waiting time from 30 to 300, it works.Thanks, the main reason is my network, may the cluster administrator block the connection to wandb service. now my solution is training on the cluster with offline mode, then switching o my PC to upload log files.