Crash with cpu offload
See original GitHub issueHi there! I have been using this configuration:
{
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e6,
"reduce_scatter": true,
"reduce_bucket_size": 2e6,
"overlap_comm": false,
"contiguous_gradients": true,
"cpu_offload":true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 5e-5,
"betas": [ 0.9, 0.999 ],
"eps": 1e-6,
"weight_decay": 0.01
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 5e-5,
"warmup_num_steps": 10000
}
}
}
To train a modified XLNet model (using the transformers library) on 4 1080ti’s.
However after ~20 iterations, after the gradients scale correctly and training begins, it crashes in this function:
complete_grad_norm_calculation_for_cpu_offload(self, params):
total_norm = 0.0
norm_type = 2.0
for p in params:
if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
param_id = self.get_param_id(p)
param_norm = self.norm_for_param_grads[param_id]
total_norm += param_norm.item()**2
With a key error in self.norm_for_param_grads[param_id]
.
I just sidestepped around this with a
try: param_norm = self.norm_for_param_grads[param_id] total_norm += param_norm.item()**2 except: pass
and it continues to train. Would anyone know what is happening?
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (2 by maintainers)
Top Results From Across the Web
My computer keeps crashing after the CPU maxes out ... - Quora
A crash after 5–10 minutes on PC games typically means either your CPU or GPU aren't properly cooled. Check that all the fans...
Read more >Edge Router X crashing with IPSec Offloading enabled
Hi,. my ER-X crashes after enabling IPSec offloading. As soon as i transfer some files over the tunnel, the router will completely lock...
Read more >System crash while undervolting after stopping stress test
1. Start stressing you CPU > Apply voltage offset after ~1min > Stop the stress test >CRASH! 2. Apply voltage offset without stressing...
Read more >Windows Crashes Every Day Or So For Unknown Reason
CPU : AMD Ryzen 5 3600 (usually auto boosting to 4.2ghz.) ... Disable your tcpip checksum offload for your Intel network driver.
Read more >MacBook Pro crashing, Panic CPU - Apple Community
I have a MacBook Pro 2020 that keeps crashing regularly. The first problem I encountered was that OneNote would stall or sometimes restart....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mrgjbd is right, this is my detailed explain: the
KeyError
was caused by unused parameter, if you disable deepspeed and usetorch.nn.parallel.DistributedDataParallel
withfind_unused_parameters=False
, it may have this error message:These errors happened when the model have trainable parameters but skipped in training, these skipped params will not go through backward, so their backward hooks in
self.create_reduce_and_remove_grad_hooks()
of zero stage2 will not run, then they have no norm_for_param_grads, if the skip is what you want, then the hack by @pedrocolon93 is the right way:try: param_norm = self.norm_for_param_grads[param_id] total_norm += param_norm.item()**2 except: pass
, or better:I am getting this same error. I am not using model-parallelism. (The is_model_parallel_parameter function still returns
True
because of deepspeed/runtime/pipe/module.py line 246.) https://github.com/huggingface/transformers/pull/9622 fixed a similar crash that happened because of gradient accumulation steps (https://github.com/microsoft/DeepSpeed/issues/671). For me it happens every time after exactly 20 steps. I am using pytorch-lightning with a huggingface/transformers model.Here is the portion of the traceback involving DeepSpeed: