question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CLI]: Can't find port file when using wandb.require("service")

See original GitHub issue

Describe the bug

When running the code snippet below using WanDB + PyTorchLightning on SLURM. The code crashes randomly for most if not all runs.

logger = WandbLogger(project="MNIST")

Traceback (most recent call last):
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
    wi.setup(kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
    self._wl = wandb_setup.setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
    ret = _setup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
    wl = _WandbSetup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
    self._setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
    self._setup_manager()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
    self._manager = wandb_manager._Manager(
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
    self._service.start()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
    self._launch_server()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
    assert ports_found
AssertionError
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
    wi.setup(kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
    self._wl = wandb_setup.setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
    ret = _setup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
    wl = _WandbSetup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
    self._setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
    self._setup_manager()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
    self._manager = wandb_manager._Manager(
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
    self._service.start()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
    self._launch_server()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
    assert ports_found
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home-mscluster/mfokam/assa/scripts/pretrain_eval.py", line 140, in train_eval
    logger = WandbLogger(
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 311, in __init__
    _ = self.experiment
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
    return get_experiment() or DummyExperiment()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
    return fn(self)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 357, in experiment
    self._experiment = wandb.init(**self._wandb_init)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1037, in init
    raise Exception("problem") from error_seen
Exception: problem
Traceback (most recent call last):
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/__main__.py", line 3, in <module>
    cli.cli(prog_name="python -m wandb")
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 96, in wrapper
    return func(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 285, in service
    server.serve()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 128, in serve
    self._inform_used_ports(grpc_port=grpc_port, sock_port=sock_port)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 65, in _inform_used_ports
    pf.write(self._port_fname)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/port_file.py", line 25, in write
    f = tempfile.NamedTemporaryFile(prefix=bname, dir=dname, mode="w", delete=False)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 540, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mfokam/tmpel7x5eip/port-21750.txti_81fekt'

Additional Files

No response

Environment

WandB version: 0.12.20

OS: Ubuntu 18.04.6 LTS

Python version: 3.8

Versions of relevant libraries: PyTorch: 1.12.0 PyTorch Lightning: 1.6.4

Additional Context

  • The code seems to crash when executed on a very slow cluster node
  • Tried to reproduce the error locally but I can only get a similar error if I insert a breakpoint point on the tempfile.py method executed (see last 6 lines of the stack trace) and I wait. After a certain period (3 - 5 seconds), I get an error similar to what I have on SLURM.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:15 (1 by maintainers)

github_iconTop GitHub Comments

8reactions
ShenghaiRongcommented, Aug 4, 2022

Hi, @anmolmann @ArnolFokam , I had the same problem when I ran the code on a very slow cluster node. I found out that the reason for the error is because of the magic number 30 in the code below https://github.com/wandb/wandb/blob/master/wandb/sdk/service/service.py#L41 def _wait_for_ports(self, fname: str, proc: subprocess.Popen = None) -> bool: time_max = time.time() + 30 When I increase the max waiting time from 30 to 300, it works.

1reaction
Adam-lxdcommented, Sep 1, 2022

@Adam-lxd , you’re right this is being caused by network slowness for sure. Please let us know if you see that your network speed is back to normal but you still encounter this issue.

Thanks, the main reason is my network, may the cluster administrator block the connection to wandb service. now my solution is training on the cluster with offline mode, then switching o my PC to upload log files.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[CLI]: Can't find port file when using wandb.require("service")
This error occurs when wandb.init is called. Therefore, the wandb folder is not yet created. In my case, I can confirm that this...
Read more >
Wandb fails at init (assert ports_found) - W&B Help
Potentially a duplicate of [CLI]: Can't find port file when using wandb.require("service") · Issue #3911 · wandb/wandb · GitHub
Read more >
FAQ - Documentation - Weights & Biases - WandB
Frequently asked questions about setting up privately-hosted versions of our app.
Read more >
Command Line Interface - Documentation - Weights & Biases
Command Line Interface ; enabled. Enable W&B. ; init. Configure a directory with Weights & Biases ; launch. Launch or queue a job...
Read more >
Troubleshooting - Documentation - Weights & Biases - WandB
3. Try running W&B Private Hosting, which operates on your machine and doesn't sync files to our cloud servers.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found