Logging error on Hydra multiruns using Pytorch lightning
See original GitHub issueWhen I do a multirun on Hydra (with my model using Pytorch lightning), the program always crashes after (at the end of) the second run. The problem seems to be Wandb closing a logger prematurely - if I remove Wandb Logger the problem dissapears.
I don’t have a simple reproducible example (however I can provide more info if necessary), but my main file is very simple:
from omegaconf import DictConfig, OmegaConf, open_dict
import hydra
from classification.data import Data
from classification.model import Classifier
from classification.utils import get_trainer
@hydra.main(config_name="config")
def run_train(hparams: DictConfig):
data = Data(batch_size=hparams.batch_size)
data.prepare_data()
data.setup("fit")
with open_dict(hparams):
hparams.len_train = len(data.train_dataloader())
model = Classifier(hparams)
trainer = get_trainer(hparams)
trainer.fit(model, data)
if __name__ == "__main__":
run_train()
Here’s the error log:
Error log
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit
stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/wandb/lib/redirect.py", line 23, in write
self.stream.write(data)
ValueError: I/O operation on closed file.
Call stack:
File "hydra_run.py", line 31, in <module>
run_train()
File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 354, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
lambda: hydra.multirun(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
return sweeper.sweep(arguments=task_overrides)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 76, in launch
ret = run_job(
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 125, in run_job
ret.return_value = task_function(task_cfg)
File "hydra_run.py", line 24, in run_train
trainer.fit(model, data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 416, in train
self.run_training_teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1136, in run_training_teardown
log.info('Saving latest checkpoint..')
Message: 'Saving latest checkpoint..'
Arguments: ()
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 247, in _flush_loggers
h_weak_ref().flush()
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1065, in flush
self.stream.flush()
ValueError: I/O operation on closed file.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:23 (5 by maintainers)
Top Results From Across the Web
Using Hydra + DDP - PyTorch Lightning
I get the following warning Missing logger folder: "1"/logs; I get the following error when running a fast_dev_run at test time: [Errno 2]...
Read more >raw - Hugging Face
- Hydra changes working directory to new logging folder for every executed run, which might not be compatible with the way some libraries...
Read more >Optuna Sweeper plugin - Hydra
To run optimization, clone the code and run the following command in the plugins/hydra_optuna_sweeper directory: python example/sphere.py --multirun
Read more >Complete tutorial on how to use Hydra in Machine Learning ...
Hydra provides you a way to maintain a log of every run without you having to worry about it. The directory structure after...
Read more >Offline Sync Stalls after Missing Artefact - W&B Help
I'm using Hydra+PL+WandB (Offline) to log a sweep of runs. env: Python 3.7.11 wandb==0.13.7 pytorch-lightning==1.8.4 hydra-core==1.3.0 ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi folks, I’m still seeing this issue if you do more than 2 runs with multirun. See this gist, where i adapted @borisdayma 's example to make it a bit more verbose. System: Python 3.7.9, wandb 0.10.10, hydra 1.0.3.
You can see the output is printed in the first 2 runs but disappears for the 3d (and all consecutive) runs. Output disappears both for plain
print
statements and using the hydra-configured python logger. And a similar error is raisedOSError: [Errno 9] Bad file descriptor
. In real runs, it seems that the jobs still run and sync to wandb even when all stdout is gone.Hey @lucmos we just added this start method so we’re still evaluating performance around it. The default start method is spawn which has better semantics / logic seperation but can interact poorly especially if your code is leveraging multiprocessing as well. We’re planning to atleast detect this case and automatically use
thread
when appropriate.