Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs

See original GitHub issue

Hi - I’m getting a new error while trying to train a model on a 8 x V100 box. I’m using pytorch lightning but don’t think that should make a difference too much.

Sys config:

Pytorch 1.8 Cuda 10.2 Ubuntu 18.04 Deepspeed 0.3.14 Triton 0.2.3 Apex master branch Pytorch lightning 1.3.0rc1

Error trace:

Epoch 0:   0%|                                                                                | 0/564 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 488, in fit
    self.dispatch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 142, in start_training
    self._results = trainer.run_stage()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in run_stage
    self.run_train()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 422, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 575, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1414, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 301, in optimizer_step
    self.lightning_module, optimizer, opt_idx, lambda_closure, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 47, in pre_optimizer_step
    lambda_closure()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 570, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 709, in backward
    result.closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 284, in backward
    self.lightning_module, closure_loss, optimizer, optimizer_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 73, in backward
    deepspeed_engine.backward(closure_loss, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1020, in backward
    self.allreduce_gradients()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 940, in allreduce_gradients
    self.optimizer.overlapping_partition_gradients_reduce_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1393, in overlapping_partition_gradients_reduce_epilogue
    self.independent_gradient_partition_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1295, in independent_gradient_partition_epilogue
    self.partition_previous_reduced_grads()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1657, in partition_previous_reduced_grads
    param.partition_gradients(partition_buffers=self.temp_grad_gpu_buffer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 460, in partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 794, in _partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 862, in _partition_gradient
    param.grad.data = dest_tensor_full_buffer.data
UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment

Issue Analytics

State:
Created 2 years ago
Comments:24 (7 by maintainers)

Top GitHub Comments

1reaction

aced125commented, Apr 13, 2021

Okay - I’m now actually finding that sometimes it works, and sometimes it doesn’t work. This is getting really weird.

I’ll run it once with some settings. It works. Then run it again and boom I get this error.

It could be because of the dataloader. Let me turn shuffle off and drop the last batch.

0reactions

tjruwasecommented, Apr 28, 2021

@SantoshGuptaML , can you clarify the exact error you are seeing since multiple issues were involved here.

To your question about printing actual tensor values, you need to use Gather api like as follows:

       for n, p in model.named_parameters():
            with deepspeed.zero.GatheredParameters(p):
                val = p.detach().to('cpu').data.float()
                print0("{} {}: {} {}".format(tag, n, val.shape, val))

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus leave more GPU...

Train 1 trillion+ parameter models - PyTorch Lightning

DeepSpeed ZeRO Stage 3 Offload - Offload optimizer states, gradients, parameters and optionally activations to CPU. Increases distributed communication volume ...

4 Problems that Cause a Faulty GPU and How to Fix them.

Sponsor: Check out ASRock's TRX40 Taichi Motherboard to get the best support for Zen 2 Threadripper. Purchase: https://amzn.to/2J5sSDOMore ...

OpenMP on GPUs, First Experiences and Best Practices

3. History of OpenMP. OpenMP is the defacto standard for directive-based programming ... 8. GPU OFFLOADING COMPILER SUPPORT. CLANG – Open-source compiler, ...

A Direct Path Between Storage and GPU Memory

It offloads computing elements, leaving them free for other work. There are DMA engines in GPUs and storage-related devices like NVMe ...