When terminate_on_nan tensorboard_logger raises ValueError
See original GitHub issueš Bug description
tensorboard_logger raises an exception when the training is terminated due to terminate_on_nan handler.
I am running multiple learning rate tests with the hydra package to tune the hyperparameters (e.g. batch size, batch accumulation factor, momentum, weight decay etc) of my network. I am using an EfficientNet-b3 network, SGD optimizer and APEX O2 mixed precision and use the following code snippet to train my network:
trainer = create_supervised_trainer(model, optimizer, loss, device=device, prepare_batch=prepare_batch, accumulation_steps=cfg.accumulation_steps, max_norm=cfg.max_norm) # added gradient accumulation and gradient clipping to create_supervised_trainer
train_evaluator = create_supervised_evaluator(model, device=device,
metrics=metrics, prepare_batch=prepare_batch)
validation_evaluator = create_supervised_evaluator(model, device=device,
metrics=metrics, prepare_batch=prepare_batch)
# Set up logging
trainer.logger = setup_logger("Trainer")
train_evaluator.logger = setup_logger("Train evaluator")
validation_evaluator.logger = setup_logger("Validation evaluator")
@trainer.on(Events.EPOCH_COMPLETED)
def compute_metrics(engine):
train_evaluator.run(train_loader)
validation_evaluator.run(val_loader)
tb_logger = TensorboardLogger(log_dir=os.getcwd())
tb_logger.attach_output_handler(
trainer,
event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval),
tag="training",
output_transform=lambda loss: {"batchloss": loss},
metric_names="all",
)
for tag, evaluator in [("training", train_evaluator), ("validation", validation_evaluator)]:
tb_logger.attach_output_handler(
evaluator,
event_name=Events.EPOCH_COMPLETED,
tag=tag,
metric_names=["loss", "AUC", "CLaccuracy"],
global_step_transform=global_step_from_engine(trainer),
)
tb_logger.attach_opt_params_handler(trainer, event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval), optimizer=optimizer, param_name='lr')
tb_logger.attach_opt_params_handler(trainer, event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval), optimizer=optimizer, param_name='momentum')
tb_logger.attach(trainer, log_handler=WeightsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval))
tb_logger.attach(trainer, log_handler=WeightsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=cfg.log_interval))
tb_logger.attach(trainer, log_handler=GradsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval))
tb_logger.attach(trainer, log_handler=GradsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=cfg.log_interval))
def score_function(engine):
return engine.state.metrics["AUC"]
model_checkpoint = ModelCheckpoint(
os.getcwd(),
n_saved=10, #save best 10 models
filename_prefix="best",
score_function=score_function,
score_name="validation_AUC",
global_step_transform=global_step_from_engine(trainer),
require_empty=cfg.require_empty,
)
ProgressBar(persist=True).attach(trainer, metric_names=['gpu:0 mem(%)', 'gpu:0 util(%)', 'batchloss'] if device.type =='cuda' else ['batchloss'])
ProgressBar(persist=True).attach(validation_evaluator, metric_names=['AUC','batchloss'])
# Clear cuda cache between training/testing such that everything fits on the GPU
@trainer.on(Events.EPOCH_COMPLETED)
@evaluator.on(Events.COMPLETED)
def empty_cuda_cache(engine):
torch.cuda.empty_cache()
import gc
gc.collect()
validation_evaluator.add_event_handler(Events.COMPLETED, model_checkpoint, {"model": model})
trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
# Should these also be terminated??
train_evaluator.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
validation_evaluator.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
# kick everything off
if cfg.use_lr_finder:
with lr_finder.attach(trainer, to_save=to_save, end_lr=cfg.lr_finder.end_lr, diverge_th=cfg.lr_finder.diverge_th, smooth_f=cfg.lr_finder.smooth) as trainer_with_lr_finder:
trainer_with_lr_finder.run(train_loader, max_epochs=cfg.lr_finder.epochs)
# Get lr_finder results
log.info("LR_finder results: %r", lr_finder.get_results())
# Plot lr_finder results (requires matplotlib)
# lr_finder.plot()
# get lr_finder suggestion for lr
log.info("LR_finder suggestion: %r", lr_finder.lr_suggestion())
else:
if cfg.use_cl_lr:
cooldown_epoch = round(cfg.max_epoch*cfg.cl_lr.cooldown_perc/100)
max_epoch = round(cooldown_epoch/2)
lr_scheduler = PiecewiseLinear(optimizer,'lr',milestones_values=[(1,cfg.cl_lr.min_lr),(max_epoch,cfg.cl_lr.max_lr),(cooldown_epoch,cfg.cl_lr.min_lr),(max_epoch,cfg.cl_lr.min_lr/1000)])
momentum_scheduler = PiecewiseLinear(optimizer,'momentum',milestones_values=[(1,cfg.cl_lr.max_momentum),(max_epoch,cfg.cl_lr.min_momentum),(cooldown_epoch,cfg.cl_lr.max_momentum),(max_epoch,cfg.cl_lr.max_momentum)])
trainer.add_event_handler(Events.ITERATION_STARTED, lr_scheduler)
trainer.add_event_handler(Events.ITERATION_STARTED, momentum_scheduler)
trainer.run(train_loader, max_epochs=epochs)
tb_logger.close()
The followig stack trace is showns when my script terminates
[2020-08-03 11:53:48,130][ignite.handlers.terminate_on_nan.TerminateOnNan][WARNING] - TerminateOnNan: Output '(tensor([[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan]], device='cuda:0'), tensor([[0.],
[0.],
[0.],
[1.],
[1.],
[1.],
[1.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.]], device='cuda:0'))' contains NaN or Inf. Stop training
2020-08-03 11:53:48,131 Validation evaluator INFO: Terminate signaled. Engine will stop after current iteration is finished.
2020-08-03 11:55:34,855 Validation evaluator INFO: Epoch[1] Complete. Time taken: 00:03:24
2020-08-03 11:55:35,415 Validation evaluator INFO: Engine run complete. Time taken: 00:03:25
[2020-08-03 11:56:45,232][root][WARNING] - NaN or Inf found in input tensor.
2020-08-03 11:56:45,349 Trainer ERROR: Engine run is terminating due to exception: The histogram is empty, please file a bug report..
Traceback (most recent call last):
File "./response_prediction/train_model.py", line 396, in <module>
run()
File "/opt/conda/lib/python3.6/site-packages/hydra/main.py", line 37, in decorated_main
strict=strict,
File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/utils.py", line 261, in run_hydra
lambda: hydra.multirun(
File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/utils.py", line 185, in run_and_report
func()
File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/utils.py", line 264, in <lambda>
overrides=args.overrides,
File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 135, in multirun
return sweeper.sweep(arguments=task_overrides)
File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 113, in sweep
results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 68, in launch
job_subdir_key="hydra.sweep.subdir",
File "/opt/conda/lib/python3.6/site-packages/hydra/core/utils.py", line 107, in run_job
ret.return_value = task_function(task_cfg)
File "./response_prediction/train_model.py", line 338, in run
trainer_with_lr_finder.run(train_loader, max_epochs=cfg.lr_finder.epochs)
File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 659, in run
return self._internal_run()
File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 723, in _internal_run
self._handle_exception(e)
File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 438, in _handle_exception
raise e
File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 711, in _internal_run
self._fire_event(Events.EPOCH_COMPLETED)
File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 394, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/opt/conda/lib/python3.6/site-packages/ignite/contrib/handlers/tensorboard_logger.py", line 370, in __call__
tag="{}grads/{}".format(tag_prefix, name), values=p.grad.detach().cpu().numpy(), global_step=global_step
File "/opt/conda/lib/python3.6/site-packages/tensorboardX/writer.py", line 496, in add_histogram
histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
File "/opt/conda/lib/python3.6/site-packages/tensorboardX/summary.py", line 200, in histogram
hist = make_histogram(values.astype(float), bins, max_bins)
File "/opt/conda/lib/python3.6/site-packages/tensorboardX/summary.py", line 238, in make_histogram
raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.
NaNās can be expected when the learning rate becomes too large during a LR_find sweep, which is why I added the terminate_on_nan handler. I was expecting the terminate_on_nan handler to catch any exceptions caused by the NaNs, as the exception causes my hyperparameter sweep to exit prematurely. Iām not entirely sure why this exception is raised, I suspect the models state dict might have been deleted before the tensorboard logger can write the results?
Environment
Installed packages: monai==0.1.0 numpy==1.18.5 nibabel==2.5.1 torchio==0.17.10 click==7.1.2 torchvision==0.5.0 pytorch_ignite==0.4rc.0.post1 torch==1.4.0 matplotlib==3.2.2 pandas==1.0.5 niwidgets==0.2.2 python-dotenv==0.13.0 SimpleITK==1.2.4 msgpack==0.5.6 xlrd==1.2.0 tensorboardx==2.0 pynvml==8.0.4 tensorboard==2.2.2 torchsummary==1.5.1 adabound==0.0.5 hydra-core==1.0.0rc1 git+https://github.com/shijianjian/EfficientNet-PyTorch-3D.git@c33efbba18970cb45f481601494379ff91d4b850#egg=efficientnet-pytorch-3d
OS Linux Ubuntu Everything is run within a docker container: nvcr.io/nvidia/pytorch:19.10-py3
Ignite has been installed within the docker container using pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
Iāll check that tonight š
Yes it is indeed the validation evaluator that catches the NaNs.
So just thinking about your suggestion that the model weights already contain NaNs. I think this is possible because of APEX mixed precision training. From what I understand, during early training the gradient scaling that is performed by APEX often leads to infs and/or nans. However, if this occurs this is captured by APEX and the gradient update step is skipped for that iteration. Looking at the gradients, in this particular case the NaNs occur in the last iteration of an epoch just before the validation evaluator is called. Perhaps the NaNs are not handled correctly in this edge case which results in NaNs in the validation evaluator.
What is weird though is that for those iterations that contain NaNs I do have a valid loss value, so I guess thatās the reason why the loss doesnāt seem to have diverged in the code snippet that you pointed to. I think APEX somehow handles this internally.
To make things weirder, I have just rerun the LR finder, got NaNs in the gradients of the last iteration again resulting in NaNs in the predictions of the validation evaluator which in turn causes
TerminateOnNan
to get flagged, however this time no exception is raised and everything finishes fineā¦I will try to do some re-runs with and without the validation, and with and without the tb logging to see if I can somehow reproduce it.
As for your question of why I was logging the weights/grads etc, I initially programmed my training script and added the LR finder sweep later. I didnāt see any pressing reasons to remove it so I left them in. Also, it now allowed me to check the gradients for NaNs so I guess they can come in handy for debugging š