Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] RuntimeError: start (0) + length () exceeds dimension size (1).

See original GitHub issue

Describe the bug I’m trying to get deepspeed zero-infinity to run using NVME offloading. I initially got an assertion error which I believe is similar to this AsyncIO Error. I followed the guidelines in this thread and reduced the max_in_cpu size to be a multiple of 512 and I manage to no longer receive this error however I now receive the following error:

  File "run_summarization.py", line 799, in <module>
    main()
  File "run_summarization.py", line 677, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1422, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 2027, in training_step
    loss = self.deepspeed.backward(loss)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 1667, in backward
    self.optimizer.backward(loss)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2793, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1774, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2049, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1810, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1868, in __reduce_and_partition_ipg_grads
Traceback (most recent call last):
  File "run_summarization.py", line 799, in <module>
    self.__partition_grads(self.__params_in_ipg_bucket, grad_partitions)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1984, in __partition_grads
    grad_partition.numel())
RuntimeError: start (0) + length (1048576) exceeds dimension size (1).
    main()

To Reproduce Steps to reproduce the behavior:

git clone https://github.com/huggingface/transformers.git
huggingface-cli login
sed -i 's/load_optimizer_states=True/load_optimizer_states=False/g' ../transformers/src/transformers/trainer.py
sed -i 's/load_lr_scheduler_states=True/load_lr_scheduler_states=False/g' ../transformers/src/transformers/trainer.py
create a json file called ds_config_zero.json with the following ds variables assigned:4.

{

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },


    "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "nvme",
                "nvme_path": "../../workspace",
                "pin_memory": true,
                "buffer_count": 4,
                "fast_init": false
            },
            "offload_param": {
                "device": "nvme",
                "nvme_path": "../../workspace",
                "max_in_cpu": 99876864
            },
            "overlap_comm": true,
            "contiguous_gradients": true,
            "sub_group_size": 1e9,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "stage3_max_live_parameters": 1e9,
            "stage3_max_reuse_distance": 1e9,
            "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

run the following code:

deepspeed transformers/examples/pytorch/summarization/run_summarization.py
   --deepspeed ds_config_zero3.json \
    --model_name_or_path allenai/led-large-16384 \
    --per_device_train_batch_size 2 \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --predict_with_generate \
    --report_to wandb \
    --load_best_model_at_end True \
    --greater_is_better True \
    --evaluation_strategy steps \
    --metric_for_best_model rouge_average \
    --pad_to_max_length True \
    --max_source_length 1024 \
    --generation_max_length 512 \
    --save_steps 1200 \
    --eval_steps 400 \
    --logging_steps 400 \
    --dataset_name kaizan/amisum_v1 \
    --learning_rate 0.00005 \
    --num_train_epochs 10 \
    --weight_decay 0.5

Expected behavior Expected to download the model, parallelise across 4 GPUs and then start training whilst offloading parameters to NVME storage

ds_report output

[2022-06-08 20:49:19,034] [WARNING] [partition_parameters.py:54:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.6/dist-packages/torch']
torch version .................... 1.8.0
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/usr/local/lib/python3.6/dist-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.2, hip 0.0

System info (please complete the following information):

OS = Linux
GPU count = 4 TeslaV100S
Python = Python 3.6.9
Any other relevant info about your setup

Launcher context deepspeed launcher

Docker context N/A

Additional context N/A

Issue Analytics

State:
Created a year ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

tjruwasecommented, Jun 14, 2022

Please try #2011. FYI, you will likely run into another error after getting past this. The new failure has to do with offload buffer management. I am looking into it.

0reactions

KMFODAcommented, Jun 17, 2022

Thanks @tjruwase, this seems to be working now! Weirdly though it also works when on the main branch as well which doesn’t have these changes. Can’t explain what happened to fix this. Nonetheless really appreciate all your help with this!

Top Results From Across the Web

RuntimeError: start (0) + length (0) exceeds dimension size (0)

I am running a model for QA ranking. This model uses 2 LSTMs and a series of attention mechanisms. It works for 1...

I used pack_padded_sequence() and put in lstm layer, but I ...

My question is: When I put pack = pack_padded_sequence(conv) in the lstm layer, I got RuntimeError: start (pack[0].size(0)) + length (1) exceeds ......

Exploring 5 PyTorch Functions - RealDevTalk

In this example, we have a square matrix tensor of dimensions 3×3, ... Example 1 - working torch.logspace(start=0, end=5, steps=6, base=2).

kernel7677d09b77 - Kaggle

function 1 - squeeze(), used to eliminate dimensions with length 1 ... length = 4) RuntimeError: start (1) + length (4) exceeds dimension...

start (0) + length (0) exceeds dimension size (0).这个错误原因

错误如下：return super(Tensor, self).split(split_size, dim)RuntimeError: start (0) + length (0) exceeds dimension size (0).