Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

accelerate test fails with deepspeed and fp16 enabled in config

See original GitHub issue

Hi there,

First, thanks for the great work.

I wanted to give accelerate a spin and followed the docs to setup a configuration file with both deepspeed and fp16 enabled. Here’s the resulting yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  offload_optimizer_device: cpu
  zero_stage: 3
distributed_type: DEEPSPEED
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

I then tried to test the setup using: accelerate test --config_file ./my_config.yaml

This then throws an error saying: AttributeError: 'DeepSpeedPlugin' object has no attribute 'fp16' which seems to be stemming from accelerate/state.py line 232: use_fp16 = self.deepspeed_plugin.fp16 if self.distributed_type == DistributedType.DEEPSPEED else self.use_fp16

Let me know if you need any more information 😃

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

nsmdgrcommented, Nov 11, 2021

Sure. I have version 0.5.5 installed today via pip install deepspeed.

0reactions

chris-opendatacommented, May 25, 2022

This is fixed in the latest release.

Thank you for your update.

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

Integration of the core DeepSpeed features via Trainer. This is an everything-done-for-you type of integration - just supply your custom config file or...

DeepSpeed Configuration JSON

DeepSpeed Configuration JSON. Contents. Batch Size Related Parameters; Optimizer Parameters; Scheduler Parameters; Communication options; FP16 training ...

Deploy BLOOM-176B and OPT-30B on Amazon SageMaker ...

Throughput reflects the number of tokens produced per second for each test. For Hugging Face Accelerate, we used the library's default loading ...

Train 1 trillion+ parameter models - PyTorch Lightning

Check out this amazing video explaining model parallelism and how it works behind the scenes: ... Below is a summary of all the...

Accelerate Stable Diffusion inference with DeepSpeed ...

Note: You need a machine with a GPU and a compatible CUDA installed. You can check this by running nvidia-smi in your terminal....