Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`"histogram_cpu" not implemented for 'BFloat16'` when using deepspeed and reporting to wandb

See original GitHub issue

Environment info

transformers version: 4.18.0
Platform: Linux-5.13.0-20-generic-x86_64-with-glibc2.10
Python version: 3.8.5
Huggingface_hub version: 0.5.1
PyTorch version (GPU?): 1.11.0+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: True
Using distributed or parallel set-up in script?: Deepspeed

Who can help

@stas00

Information

Model I am using (Bert, XLNet …): bart-large

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I’m using a training script adapted from the run_summarization.py example with a model using bart-large architecture and a custom tokenizer. I’m working locally on my workstation with two RTX 3090s. I had been training using deepspeed and fp16, but I saw that the latest transformers update added bf16 support to the deepspeed integration, so I wanted to try that in order to reduce the constant overflow errors I had been getting.

But when using deepspeed, bf16, and reporting to wandb, my training crashes.

I’m able to reproduce the error using the example scripts:

deepspeed run_summarization.py \
      --model_name_or_path facebook/bart-large \
      --dataset_name cnn_dailymail --dataset_config_name 3.0.0 \
      --do_train --per_device_train_batch_size 4 --bf16  \
      --overwrite_output_dir --output_dir models/text_summarization \
      --deepspeed config/deepspeed_config-zero2-bf16.json

with the deepspeed config being:

{
  "bf16": {
     "enabled": true
   },

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "cpu_offload": false
  },

  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

After 500 steps (when saving the first checkpoint), wandb throws this error: RuntimeError: "histogram_cpu" not implemented for 'BFloat16'

The error doesn’t occur if I run the same script without deepspeed. And no other error gets thrown if I use deepspeed and don’t report to wandb.

A very similar issue was reported to wandb last month. The wandb people say it’s an issue with pytorch and not wandb, but since everything is working without deepspeed, maybe there’s something different about how the deepspeed integration is reporting to wandb?

Expected behavior

The training should continue without crashing, and should report as much info to wandb as possible (not sure if there are limits to that introduced by bf16 )