Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logging error on Hydra multiruns using Pytorch lightning

See original GitHub issue

When I do a multirun on Hydra (with my model using Pytorch lightning), the program always crashes after (at the end of) the second run. The problem seems to be Wandb closing a logger prematurely - if I remove Wandb Logger the problem dissapears.

I don’t have a simple reproducible example (however I can provide more info if necessary), but my main file is very simple:

from omegaconf import DictConfig, OmegaConf, open_dict
import hydra

from classification.data import Data
from classification.model import Classifier
from classification.utils import get_trainer

@hydra.main(config_name="config")
def run_train(hparams: DictConfig):
    
    data = Data(batch_size=hparams.batch_size)

    data.prepare_data()
    data.setup("fit")
    
    with open_dict(hparams):
        hparams.len_train = len(data.train_dataloader())

    model = Classifier(hparams)
    
    trainer = get_trainer(hparams)    
    
    trainer.fit(model, data)


if __name__ == "__main__":
    run_train()

Here’s the error log:

Error log

--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit
  stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/wandb/lib/redirect.py", line 23, in write
  self.stream.write(data)
ValueError: I/O operation on closed file.
Call stack:
File "hydra_run.py", line 31, in <module>
  run_train()
File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
  _run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 354, in _run_hydra
  run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
  return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
  lambda: hydra.multirun(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
  return sweeper.sweep(arguments=task_overrides)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
  results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 76, in launch
  ret = run_job(
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 125, in run_job
  ret.return_value = task_function(task_cfg)
File "hydra_run.py", line 24, in run_train
  trainer.fit(model, data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
  result = fn(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
  results = self.accelerator_backend.train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
  results = self.trainer.run_pretrain_routine(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
  self.train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 416, in train
  self.run_training_teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1136, in run_training_teardown
  log.info('Saving latest checkpoint..')
Message: 'Saving latest checkpoint..'
Arguments: ()
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 247, in _flush_loggers
  h_weak_ref().flush()
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1065, in flush
  self.stream.flush()
ValueError: I/O operation on closed file.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:23 (5 by maintainers)

Top GitHub Comments

4reactions

tomsercucommented, Nov 12, 2020

Hi folks, I’m still seeing this issue if you do more than 2 runs with multirun. See this gist, where i adapted @borisdayma 's example to make it a bit more verbose. System: Python 3.7.9, wandb 0.10.10, hydra 1.0.3.

You can see the output is printed in the first 2 runs but disappears for the 3d (and all consecutive) runs. Output disappears both for plain print statements and using the hydra-configured python logger. And a similar error is raised OSError: [Errno 9] Bad file descriptor. In real runs, it seems that the jobs still run and sync to wandb even when all stdout is gone.

3reactions

vanpeltcommented, Feb 26, 2021

Hey @lucmos we just added this start method so we’re still evaluating performance around it. The default start method is spawn which has better semantics / logic seperation but can interact poorly especially if your code is leveraging multiprocessing as well. We’re planning to atleast detect this case and automatically use thread when appropriate.