question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Crash with cpu offload

See original GitHub issue

Hi there! I have been using this configuration:

{
"zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e6,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e6,
        "overlap_comm": false,
        "contiguous_gradients": true,
        "cpu_offload":true
    },
     "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 5e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-6,
            "weight_decay": 0.01
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 5e-5,
            "warmup_num_steps": 10000
        }
    }
}

To train a modified XLNet model (using the transformers library) on 4 1080ti’s.

However after ~20 iterations, after the gradients scale correctly and training begins, it crashes in this function:

complete_grad_norm_calculation_for_cpu_offload(self, params):
        total_norm = 0.0
        norm_type = 2.0
        for p in params:
            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
                param_id = self.get_param_id(p)
                param_norm = self.norm_for_param_grads[param_id]
                total_norm += param_norm.item()**2

With a key error in self.norm_for_param_grads[param_id].

I just sidestepped around this with a try: param_norm = self.norm_for_param_grads[param_id] total_norm += param_norm.item()**2 except: pass and it continues to train. Would anyone know what is happening?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
ghosthamletcommented, Mar 15, 2021

@mrgjbd is right, this is my detailed explain: the KeyError was caused by unused parameter, if you disable deepspeed and use torch.nn.parallel.DistributedDataParallel with find_unused_parameters=False, it may have this error message:

    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. 
This error indicates that your module has parameters that were not used in producing loss. 
You can enable unused parameter detection by 
(1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; 
(2) making sure all `forward` function outputs participate in calculating loss.
 If you already have done the above two steps, then the distributed data parallel module wasn't able to locate 
the output tensors in the return value of your module's `forward` function. 
Please include the loss function and the structure of the return value of `forward` of your module when 
reporting this issue (e.g. list, dict, iterable).

These errors happened when the model have trainable parameters but skipped in training, these skipped params will not go through backward, so their backward hooks in self.create_reduce_and_remove_grad_hooks() of zero stage2 will not run, then they have no norm_for_param_grads, if the skip is what you want, then the hack by @pedrocolon93 is the right way: try: param_norm = self.norm_for_param_grads[param_id] total_norm += param_norm.item()**2 except: pass , or better:

if param_id in self.norm_for_param_grads: 
    param_norm = self.norm_for_param_grads[param_id] 
    total_norm += param_norm.item()**2 
2reactions
HHousencommented, Feb 25, 2021

I am getting this same error. I am not using model-parallelism. (The is_model_parallel_parameter function still returns True because of deepspeed/runtime/pipe/module.py line 246.) https://github.com/huggingface/transformers/pull/9622 fixed a similar crash that happened because of gradient accumulation steps (https://github.com/microsoft/DeepSpeed/issues/671). For me it happens every time after exactly 20 steps. I am using pytorch-lightning with a huggingface/transformers model.

Here is the portion of the traceback involving DeepSpeed:

  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 30, in pre_optimizer_step
    deepspeed_engine.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py", line 959, in step
    self._take_model_step(lr_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py", line 914, in _take_model_step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/stage2.py", line 1379, in step
    self.params_in_partition[i]))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/stage2.py", line 881, in complete_grad_norm_calculation_for_cpu_offload
    param_norm = self.norm_for_param_grads[param_id]
KeyError: 8
Read more comments on GitHub >

github_iconTop Results From Across the Web

My computer keeps crashing after the CPU maxes out ... - Quora
A crash after 5–10 minutes on PC games typically means either your CPU or GPU aren't properly cooled. Check that all the fans...
Read more >
Edge Router X crashing with IPSec Offloading enabled
Hi,. my ER-X crashes after enabling IPSec offloading. As soon as i transfer some files over the tunnel, the router will completely lock...
Read more >
System crash while undervolting after stopping stress test
1. Start stressing you CPU > Apply voltage offset after ~1min > Stop the stress test >CRASH! 2. Apply voltage offset without stressing...
Read more >
Windows Crashes Every Day Or So For Unknown Reason
CPU : AMD Ryzen 5 3600 (usually auto boosting to 4.2ghz.) ... Disable your tcpip checksum offload for your Intel network driver.
Read more >
MacBook Pro crashing, Panic CPU - Apple Community
I have a MacBook Pro 2020 that keeps crashing regularly. The first problem I encountered was that OneNote would stall or sometimes restart....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found