Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When terminate_on_nan tensorboard_logger raises ValueError

See original GitHub issue

🐛 Bug description

tensorboard_logger raises an exception when the training is terminated due to terminate_on_nan handler.

I am running multiple learning rate tests with the hydra package to tune the hyperparameters (e.g. batch size, batch accumulation factor, momentum, weight decay etc) of my network. I am using an EfficientNet-b3 network, SGD optimizer and APEX O2 mixed precision and use the following code snippet to train my network:

   trainer = create_supervised_trainer(model, optimizer, loss, device=device, prepare_batch=prepare_batch, accumulation_steps=cfg.accumulation_steps, max_norm=cfg.max_norm) # added gradient accumulation and gradient clipping to create_supervised_trainer
    train_evaluator = create_supervised_evaluator(model, device=device,
                                                metrics=metrics, prepare_batch=prepare_batch)
    validation_evaluator = create_supervised_evaluator(model, device=device,
                                                metrics=metrics, prepare_batch=prepare_batch)

    # Set up logging
    trainer.logger = setup_logger("Trainer")
    train_evaluator.logger = setup_logger("Train evaluator")
    validation_evaluator.logger = setup_logger("Validation evaluator")

    @trainer.on(Events.EPOCH_COMPLETED)
    def compute_metrics(engine):
        train_evaluator.run(train_loader)
        validation_evaluator.run(val_loader)


    tb_logger = TensorboardLogger(log_dir=os.getcwd())

    tb_logger.attach_output_handler(
            trainer,
            event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval),
            tag="training",
            output_transform=lambda loss: {"batchloss": loss},
            metric_names="all",
        )

    for tag, evaluator in [("training", train_evaluator), ("validation", validation_evaluator)]:
        tb_logger.attach_output_handler(
            evaluator,
            event_name=Events.EPOCH_COMPLETED,
            tag=tag,
            metric_names=["loss", "AUC", "CLaccuracy"],
            global_step_transform=global_step_from_engine(trainer),
        )

    tb_logger.attach_opt_params_handler(trainer, event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval), optimizer=optimizer, param_name='lr')
    tb_logger.attach_opt_params_handler(trainer, event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval), optimizer=optimizer, param_name='momentum')
    tb_logger.attach(trainer, log_handler=WeightsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval))
    tb_logger.attach(trainer, log_handler=WeightsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=cfg.log_interval))
    tb_logger.attach(trainer, log_handler=GradsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=cfg.log_interval))
    tb_logger.attach(trainer, log_handler=GradsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=cfg.log_interval))


    def score_function(engine):
        return engine.state.metrics["AUC"]

    model_checkpoint = ModelCheckpoint(
        os.getcwd(),
        n_saved=10, #save best 10 models
        filename_prefix="best",
        score_function=score_function,
        score_name="validation_AUC",
        global_step_transform=global_step_from_engine(trainer),
        require_empty=cfg.require_empty,
    )

    ProgressBar(persist=True).attach(trainer, metric_names=['gpu:0 mem(%)', 'gpu:0 util(%)', 'batchloss'] if device.type =='cuda' else ['batchloss'])
    ProgressBar(persist=True).attach(validation_evaluator, metric_names=['AUC','batchloss'])

    # Clear cuda cache between training/testing such that everything fits on the GPU
    @trainer.on(Events.EPOCH_COMPLETED)
    @evaluator.on(Events.COMPLETED)
    def empty_cuda_cache(engine):
        torch.cuda.empty_cache()
        import gc
        gc.collect()

    validation_evaluator.add_event_handler(Events.COMPLETED, model_checkpoint, {"model": model})


    trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
    # Should these also be terminated??
    train_evaluator.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
    validation_evaluator.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())

    # kick everything off
    if cfg.use_lr_finder:
        with lr_finder.attach(trainer, to_save=to_save, end_lr=cfg.lr_finder.end_lr, diverge_th=cfg.lr_finder.diverge_th, smooth_f=cfg.lr_finder.smooth) as trainer_with_lr_finder:
            trainer_with_lr_finder.run(train_loader, max_epochs=cfg.lr_finder.epochs)

        # Get lr_finder results
        log.info("LR_finder results: %r", lr_finder.get_results())

        # Plot lr_finder results (requires matplotlib)
        # lr_finder.plot()

        # get lr_finder suggestion for lr
        log.info("LR_finder suggestion: %r", lr_finder.lr_suggestion())
    else:
        if cfg.use_cl_lr:
            cooldown_epoch = round(cfg.max_epoch*cfg.cl_lr.cooldown_perc/100)
            max_epoch = round(cooldown_epoch/2)
            lr_scheduler = PiecewiseLinear(optimizer,'lr',milestones_values=[(1,cfg.cl_lr.min_lr),(max_epoch,cfg.cl_lr.max_lr),(cooldown_epoch,cfg.cl_lr.min_lr),(max_epoch,cfg.cl_lr.min_lr/1000)])
            momentum_scheduler = PiecewiseLinear(optimizer,'momentum',milestones_values=[(1,cfg.cl_lr.max_momentum),(max_epoch,cfg.cl_lr.min_momentum),(cooldown_epoch,cfg.cl_lr.max_momentum),(max_epoch,cfg.cl_lr.max_momentum)])
            trainer.add_event_handler(Events.ITERATION_STARTED, lr_scheduler)
            trainer.add_event_handler(Events.ITERATION_STARTED, momentum_scheduler)
        trainer.run(train_loader, max_epochs=epochs)

    tb_logger.close()

The followig stack trace is showns when my script terminates

[2020-08-03 11:53:48,130][ignite.handlers.terminate_on_nan.TerminateOnNan][WARNING] - TerminateOnNan: Output '(tensor([[nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan]], device='cuda:0'), tensor([[0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.]], device='cuda:0'))' contains NaN or Inf. Stop training
2020-08-03 11:53:48,131 Validation evaluator INFO: Terminate signaled. Engine will stop after current iteration is finished.
2020-08-03 11:55:34,855 Validation evaluator INFO: Epoch[1] Complete. Time taken: 00:03:24
2020-08-03 11:55:35,415 Validation evaluator INFO: Engine run complete. Time taken: 00:03:25
[2020-08-03 11:56:45,232][root][WARNING] - NaN or Inf found in input tensor.
2020-08-03 11:56:45,349 Trainer ERROR: Engine run is terminating due to exception: The histogram is empty, please file a bug report..
Traceback (most recent call last):
  File "./response_prediction/train_model.py", line 396, in <module>
    run()
  File "/opt/conda/lib/python3.6/site-packages/hydra/main.py", line 37, in decorated_main
    strict=strict,
  File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/utils.py", line 261, in run_hydra
    lambda: hydra.multirun(
  File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/utils.py", line 185, in run_and_report
    func()
  File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/utils.py", line 264, in <lambda>
    overrides=args.overrides,
  File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 135, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 113, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/opt/conda/lib/python3.6/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 68, in launch
    job_subdir_key="hydra.sweep.subdir",
  File "/opt/conda/lib/python3.6/site-packages/hydra/core/utils.py", line 107, in run_job
    ret.return_value = task_function(task_cfg)
  File "./response_prediction/train_model.py", line 338, in run
    trainer_with_lr_finder.run(train_loader, max_epochs=cfg.lr_finder.epochs)
  File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 659, in run
    return self._internal_run()
  File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 723, in _internal_run
    self._handle_exception(e)
  File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 438, in _handle_exception
    raise e
  File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 711, in _internal_run
    self._fire_event(Events.EPOCH_COMPLETED)
  File "/opt/conda/lib/python3.6/site-packages/ignite/engine/engine.py", line 394, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/ignite/contrib/handlers/tensorboard_logger.py", line 370, in __call__
    tag="{}grads/{}".format(tag_prefix, name), values=p.grad.detach().cpu().numpy(), global_step=global_step
  File "/opt/conda/lib/python3.6/site-packages/tensorboardX/writer.py", line 496, in add_histogram
    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
  File "/opt/conda/lib/python3.6/site-packages/tensorboardX/summary.py", line 200, in histogram
    hist = make_histogram(values.astype(float), bins, max_bins)
  File "/opt/conda/lib/python3.6/site-packages/tensorboardX/summary.py", line 238, in make_histogram
    raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.

NaN’s can be expected when the learning rate becomes too large during a LR_find sweep, which is why I added the terminate_on_nan handler. I was expecting the terminate_on_nan handler to catch any exceptions caused by the NaNs, as the exception causes my hyperparameter sweep to exit prematurely. I’m not entirely sure why this exception is raised, I suspect the models state dict might have been deleted before the tensorboard logger can write the results?

Environment

Installed packages: monai==0.1.0 numpy==1.18.5 nibabel==2.5.1 torchio==0.17.10 click==7.1.2 torchvision==0.5.0 pytorch_ignite==0.4rc.0.post1 torch==1.4.0 matplotlib==3.2.2 pandas==1.0.5 niwidgets==0.2.2 python-dotenv==0.13.0 SimpleITK==1.2.4 msgpack==0.5.6 xlrd==1.2.0 tensorboardx==2.0 pynvml==8.0.4 tensorboard==2.2.2 torchsummary==1.5.1 adabound==0.0.5 hydra-core==1.0.0rc1 git+https://github.com/shijianjian/EfficientNet-PyTorch-3D.git@c33efbba18970cb45f481601494379ff91d4b850#egg=efficientnet-pytorch-3d

OS Linux Ubuntu Everything is run within a docker container: nvcr.io/nvidia/pytorch:19.10-py3

Ignite has been installed within the docker container using pip

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

nwschurinkcommented, Aug 4, 2020

I’ll check that tonight 👍

1reaction

nwschurinkcommented, Aug 4, 2020

Yes it is indeed the validation evaluator that catches the NaNs.

So just thinking about your suggestion that the model weights already contain NaNs. I think this is possible because of APEX mixed precision training. From what I understand, during early training the gradient scaling that is performed by APEX often leads to infs and/or nans. However, if this occurs this is captured by APEX and the gradient update step is skipped for that iteration. Looking at the gradients, in this particular case the NaNs occur in the last iteration of an epoch just before the validation evaluator is called. Perhaps the NaNs are not handled correctly in this edge case which results in NaNs in the validation evaluator.

What is weird though is that for those iterations that contain NaNs I do have a valid loss value, so I guess that’s the reason why the loss doesn’t seem to have diverged in the code snippet that you pointed to. I think APEX somehow handles this internally.

To make things weirder, I have just rerun the LR finder, got NaNs in the gradients of the last iteration again resulting in NaNs in the predictions of the validation evaluator which in turn causes TerminateOnNan to get flagged, however this time no exception is raised and everything finishes fine…

I will try to do some re-runs with and without the validation, and with and without the tb logging to see if I can somehow reproduce it.

As for your question of why I was logging the weights/grads etc, I initially programmed my training script and added the LR finder sweep later. I didn’t see any pressing reasons to remove it so I left them in. Also, it now allowed me to check the gradients for NaNs so I guess they can come in handy for debugging 😉

Top Results From Across the Web

ARM/train.py at master · wagnew3/ARM · GitHub

raise ValueError (str_error +. # " Refuse to remove positive expr_id"). os.system('mkdir -p ' + logdir). else: assert os.path.isdir(logdir).

PyTorch-Lightning Documentation

When we added the log key in the return dictionary it went into the built in tensorboard logger. But you could have also...

Error while creating a log file using TensorBoard with Keras

raise ValueError("If printing histograms, validation_data must be " ValueError: If printing histograms, validation_data must be provided, and ...

tarantella.Group Example - Program Talk

def get_pipelining_group_for_rank(self, rank): if not rank in self.partition_mapping: raise ValueError(f"Rank {rank} not found in the mapping of partition ...

Source code for ignite.contrib.engines.common - PyTorch

TerminateOnNan` - handler to setup learning rate scheduling ... not None: raise ValueError( "Arguments output_path and save_handler are mutually exclusive.