Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] loss = NaN with label_smoothing and full-fp16 eval

See original GitHub issue

It looks like our --label_smoothing_factor Trainer’s feature doesn’t handle fp16 well. It’s a problem with the deepspeed zero3 I’m integrating right now, since it evals in fp16, but also can be reproduced with the recently added --fp16_full_eval trainer option.

To reproduce:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 python examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500  --max_val_samples 500 --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --fp16_full_eval

***** eval metrics *****
  eval_bleu                 = 24.1257
  eval_gen_len              =  39.554
  eval_loss                 =     nan
  eval_mem_cpu_alloc_delta  =    56MB
  eval_mem_cpu_peaked_delta =     0MB
  eval_mem_gpu_alloc_delta  =   116MB
  eval_mem_gpu_peaked_delta =   374MB
  eval_runtime              = 25.3246
  eval_samples              =     500
  eval_samples_per_second   =  19.744
  init_mem_cpu_alloc_delta  =     2MB
  init_mem_cpu_peaked_delta =     0MB
  init_mem_gpu_alloc_delta  =     0MB
  init_mem_gpu_peaked_delta =     0MB

If someone in the community would like to have a look at solving this puzzle, please refer to the discussion of this Issue.

Basically, we would like to try to find a way to perform label smoothing under full fp16 while finding a way to handle NaNs so that the final loss is not a NaN.

And for the reference value running the same script w/o --fp16_full_eval should give you the “golden” eval_loss - i.e. ideally it should be about the same with --fp16_full_eval (if possible that is).

Thank you!

@sgugger

Issue Analytics

State:
Created 3 years ago
Comments:20 (13 by maintainers)

Top GitHub Comments

2reactions

vladdycommented, Mar 12, 2021

I’m interested in taking a stab on this!

1reaction

vladdycommented, Mar 19, 2021

I think, this simple solution makes sense to be applied as it is also generic enough to cover all the cases. I doubt it is possible to find a better approach within short time and it does not appear it is necessary to spend more time on this (at least, for now). Feel free to do the PR as you offered it!

Top Results From Across the Web

`nan` training loss but eval loss does improve over time

I've been playing around with the XLSR-53 fine-tuning functionality but I keep getting nan training loss. Audio files I'm using are: Down-sampled to...

NaN loss and also NaN output during training using custom ...

I am trying to train a network, which takes as input, output from the last convolutional layers of 3 pre-trained models (VGG NET)....

Trainer — PyTorch Lightning 1.8.5.post0 documentation

Running the training, validation and test dataloaders. Calling the Callbacks at the appropriate times. Putting batches and computations on the correct devices.

Tensorflow NaN loss during training: trying to reshape logits ...

EstimatorSpec(mode=mode, loss=loss, train_op=train_op) # Add evaluation metrics (for EVAL mode) eval_metric_ops = { "accuracy": ...

发布 · Greenplum / Transformers - GitCode

v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework ... [trainer] fix nan in full-fp16 label_smoothing eval #10815 (@stas00) ...