[BUG] RuntimeError: start (0) + length () exceeds dimension size (1).
See original GitHub issueDescribe the bug I’m trying to get deepspeed zero-infinity to run using NVME offloading. I initially got an assertion error which I believe is similar to this AsyncIO Error. I followed the guidelines in this thread and reduced the max_in_cpu size to be a multiple of 512 and I manage to no longer receive this error however I now receive the following error:
File "run_summarization.py", line 799, in <module>
main()
File "run_summarization.py", line 677, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1422, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 2027, in training_step
loss = self.deepspeed.backward(loss)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 1667, in backward
self.optimizer.backward(loss)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2793, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1774, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2049, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1810, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1868, in __reduce_and_partition_ipg_grads
Traceback (most recent call last):
File "run_summarization.py", line 799, in <module>
self.__partition_grads(self.__params_in_ipg_bucket, grad_partitions)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1984, in __partition_grads
grad_partition.numel())
RuntimeError: start (0) + length (1048576) exceeds dimension size (1).
main()
To Reproduce Steps to reproduce the behavior:
git clone https://github.com/huggingface/transformers.git
- huggingface-cli login
sed -i 's/load_optimizer_states=True/load_optimizer_states=False/g' ../transformers/src/transformers/trainer.py
sed -i 's/load_lr_scheduler_states=True/load_lr_scheduler_states=False/g' ../transformers/src/transformers/trainer.py
- create a json file called ds_config_zero.json with the following ds variables assigned:4.
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "../../workspace",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "../../workspace",
"max_in_cpu": 99876864
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
- run the following code:
deepspeed transformers/examples/pytorch/summarization/run_summarization.py
--deepspeed ds_config_zero3.json \
--model_name_or_path allenai/led-large-16384 \
--per_device_train_batch_size 2 \
--output_dir output_dir \
--overwrite_output_dir \
--do_train \
--predict_with_generate \
--report_to wandb \
--load_best_model_at_end True \
--greater_is_better True \
--evaluation_strategy steps \
--metric_for_best_model rouge_average \
--pad_to_max_length True \
--max_source_length 1024 \
--generation_max_length 512 \
--save_steps 1200 \
--eval_steps 400 \
--logging_steps 400 \
--dataset_name kaizan/amisum_v1 \
--learning_rate 0.00005 \
--num_train_epochs 10 \
--weight_decay 0.5
Expected behavior Expected to download the model, parallelise across 4 GPUs and then start training whilst offloading parameters to NVME storage
ds_report output
[2022-06-08 20:49:19,034] [WARNING] [partition_parameters.py:54:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.6/dist-packages/torch']
torch version .................... 1.8.0
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/usr/local/lib/python3.6/dist-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.2, hip 0.0
System info (please complete the following information):
- OS = Linux
- GPU count = 4 TeslaV100S
- Python = Python 3.6.9
- Any other relevant info about your setup
Launcher context
deepspeed
launcher
Docker context N/A
Additional context N/A
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
RuntimeError: start (0) + length (0) exceeds dimension size (0)
I am running a model for QA ranking. This model uses 2 LSTMs and a series of attention mechanisms. It works for 1...
Read more >I used pack_padded_sequence() and put in lstm layer, but I ...
My question is: When I put pack = pack_padded_sequence(conv) in the lstm layer, I got RuntimeError: start (pack[0].size(0)) + length (1) exceeds ......
Read more >Exploring 5 PyTorch Functions - RealDevTalk
In this example, we have a square matrix tensor of dimensions 3×3, ... Example 1 - working torch.logspace(start=0, end=5, steps=6, base=2).
Read more >kernel7677d09b77 - Kaggle
function 1 - squeeze(), used to eliminate dimensions with length 1 ... length = 4) RuntimeError: start (1) + length (4) exceeds dimension...
Read more >start (0) + length (0) exceeds dimension size (0).这个错误原因
错误如下:return super(Tensor, self).split(split_size, dim)RuntimeError: start (0) + length (0) exceeds dimension size (0).
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Please try #2011. FYI, you will likely run into another error after getting past this. The new failure has to do with offload buffer management. I am looking into it.
Thanks @tjruwase, this seems to be working now! Weirdly though it also works when on the main branch as well which doesn’t have these changes. Can’t explain what happened to fix this. Nonetheless really appreciate all your help with this!