question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`"histogram_cpu" not implemented for 'BFloat16'` when using deepspeed and reporting to wandb

See original GitHub issue

Environment info

  • transformers version: 4.18.0
  • Platform: Linux-5.13.0-20-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • Huggingface_hub version: 0.5.1
  • PyTorch version (GPU?): 1.11.0+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: Deepspeed

Who can help

@stas00

Information

Model I am using (Bert, XLNet …): bart-large

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

I’m using a training script adapted from the run_summarization.py example with a model using bart-large architecture and a custom tokenizer. I’m working locally on my workstation with two RTX 3090s. I had been training using deepspeed and fp16, but I saw that the latest transformers update added bf16 support to the deepspeed integration, so I wanted to try that in order to reduce the constant overflow errors I had been getting.

But when using deepspeed, bf16, and reporting to wandb, my training crashes.

I’m able to reproduce the error using the example scripts:

deepspeed run_summarization.py \
      --model_name_or_path facebook/bart-large \
      --dataset_name cnn_dailymail --dataset_config_name 3.0.0 \
      --do_train --per_device_train_batch_size 4 --bf16  \
      --overwrite_output_dir --output_dir models/text_summarization \
      --deepspeed config/deepspeed_config-zero2-bf16.json 

with the deepspeed config being:

{
  "bf16": {
     "enabled": true
   },

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "cpu_offload": false
  },

  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

After 500 steps (when saving the first checkpoint), wandb throws this error: RuntimeError: "histogram_cpu" not implemented for 'BFloat16'

The error doesn’t occur if I run the same script without deepspeed. And no other error gets thrown if I use deepspeed and don’t report to wandb.

A very similar issue was reported to wandb last month. The wandb people say it’s an issue with pytorch and not wandb, but since everything is working without deepspeed, maybe there’s something different about how the deepspeed integration is reporting to wandb?

Expected behavior

The training should continue without crashing, and should report as much info to wandb as possible (not sure if there are limits to that introduced by bf16 )

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jncaseycommented, Jun 28, 2022

Late update: This seems to be fixed with the release of pytorch 1.12

1reaction
jncaseycommented, Apr 11, 2022

Got it! Thanks for the super clear explanation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: "histogram_cpu" not implemented for 'BFloat16 ...
Describe the bug When using BFloat16 floating point format, I get the following runtime error: It seems that "histogram_cpu" is not ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found