question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logging error on Hydra multiruns using Pytorch lightning

See original GitHub issue

When I do a multirun on Hydra (with my model using Pytorch lightning), the program always crashes after (at the end of) the second run. The problem seems to be Wandb closing a logger prematurely - if I remove Wandb Logger the problem dissapears.

I don’t have a simple reproducible example (however I can provide more info if necessary), but my main file is very simple:

from omegaconf import DictConfig, OmegaConf, open_dict
import hydra

from classification.data import Data
from classification.model import Classifier
from classification.utils import get_trainer

@hydra.main(config_name="config")
def run_train(hparams: DictConfig):
    
    data = Data(batch_size=hparams.batch_size)

    data.prepare_data()
    data.setup("fit")
    
    with open_dict(hparams):
        hparams.len_train = len(data.train_dataloader())

    model = Classifier(hparams)
    
    trainer = get_trainer(hparams)    
    
    trainer.fit(model, data)


if __name__ == "__main__":
    run_train()

Here’s the error log:

Error log
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit
  stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/wandb/lib/redirect.py", line 23, in write
  self.stream.write(data)
ValueError: I/O operation on closed file.
Call stack:
File "hydra_run.py", line 31, in <module>
  run_train()
File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
  _run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 354, in _run_hydra
  run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
  return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
  lambda: hydra.multirun(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
  return sweeper.sweep(arguments=task_overrides)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
  results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 76, in launch
  ret = run_job(
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 125, in run_job
  ret.return_value = task_function(task_cfg)
File "hydra_run.py", line 24, in run_train
  trainer.fit(model, data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
  result = fn(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
  results = self.accelerator_backend.train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
  results = self.trainer.run_pretrain_routine(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
  self.train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 416, in train
  self.run_training_teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1136, in run_training_teardown
  log.info('Saving latest checkpoint..')
Message: 'Saving latest checkpoint..'
Arguments: ()
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 247, in _flush_loggers
  h_weak_ref().flush()
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1065, in flush
  self.stream.flush()
ValueError: I/O operation on closed file.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:23 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
tomsercucommented, Nov 12, 2020

Hi folks, I’m still seeing this issue if you do more than 2 runs with multirun. See this gist, where i adapted @borisdayma 's example to make it a bit more verbose. System: Python 3.7.9, wandb 0.10.10, hydra 1.0.3.

You can see the output is printed in the first 2 runs but disappears for the 3d (and all consecutive) runs. Output disappears both for plain print statements and using the hydra-configured python logger. And a similar error is raised OSError: [Errno 9] Bad file descriptor. In real runs, it seems that the jobs still run and sync to wandb even when all stdout is gone.

3reactions
vanpeltcommented, Feb 26, 2021

Hey @lucmos we just added this start method so we’re still evaluating performance around it. The default start method is spawn which has better semantics / logic seperation but can interact poorly especially if your code is leveraging multiprocessing as well. We’re planning to atleast detect this case and automatically use thread when appropriate.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using Hydra + DDP - PyTorch Lightning
I get the following warning Missing logger folder: "1"/logs; I get the following error when running a fast_dev_run at test time: [Errno 2]...
Read more >
raw - Hugging Face
- Hydra changes working directory to new logging folder for every executed run, which might not be compatible with the way some libraries...
Read more >
Optuna Sweeper plugin - Hydra
To run optimization, clone the code and run the following command in the plugins/hydra_optuna_sweeper directory: python example/sphere.py --multirun
Read more >
Complete tutorial on how to use Hydra in Machine Learning ...
Hydra provides you a way to maintain a log of every run without you having to worry about it. The directory structure after...
Read more >
Offline Sync Stalls after Missing Artefact - W&B Help
I'm using Hydra+PL+WandB (Offline) to log a sweep of runs. env: Python 3.7.11 wandb==0.13.7 pytorch-lightning==1.8.4 hydra-core==1.3.0 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found