`"histogram_cpu" not implemented for 'BFloat16'` when using deepspeed and reporting to wandb
See original GitHub issueEnvironment info
transformers
version: 4.18.0- Platform: Linux-5.13.0-20-generic-x86_64-with-glibc2.10
- Python version: 3.8.5
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: Deepspeed
Who can help
Information
Model I am using (Bert, XLNet …): bart-large
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I’m using a training script adapted from the run_summarization.py example with a model using bart-large architecture and a custom tokenizer. I’m working locally on my workstation with two RTX 3090s. I had been training using deepspeed and fp16, but I saw that the latest transformers update added bf16 support to the deepspeed integration, so I wanted to try that in order to reduce the constant overflow errors I had been getting.
But when using deepspeed, bf16, and reporting to wandb, my training crashes.
I’m able to reproduce the error using the example scripts:
deepspeed run_summarization.py \
--model_name_or_path facebook/bart-large \
--dataset_name cnn_dailymail --dataset_config_name 3.0.0 \
--do_train --per_device_train_batch_size 4 --bf16 \
--overwrite_output_dir --output_dir models/text_summarization \
--deepspeed config/deepspeed_config-zero2-bf16.json
with the deepspeed config being:
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
After 500 steps (when saving the first checkpoint), wandb throws this error:
RuntimeError: "histogram_cpu" not implemented for 'BFloat16'
The error doesn’t occur if I run the same script without deepspeed. And no other error gets thrown if I use deepspeed and don’t report to wandb.
A very similar issue was reported to wandb last month. The wandb people say it’s an issue with pytorch and not wandb, but since everything is working without deepspeed, maybe there’s something different about how the deepspeed integration is reporting to wandb?
Expected behavior
The training should continue without crashing, and should report as much info to wandb as possible (not sure if there are limits to that introduced by bf16 )
Issue Analytics
- State:
- Created a year ago
- Comments:9 (8 by maintainers)
Late update: This seems to be fixed with the release of pytorch 1.12
Got it! Thanks for the super clear explanation.