Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

caffe2 error in forward method when using fsdp

See original GitHub issue

System Info

- `Accelerate` version: 0.11.0
- Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.8.5
- Numpy version: 1.23.1
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'min_num_params': 2000, 'offload_params': False, 'sharding_strategy': 1}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I attach a full error paste here: https://carperai.notion.site/RuntimeError-The-tensor-has-a-non-zero-number-of-elements-but-its-data-is-not-allocated-yet-Caffe-1cde9ba4104e47c2be65377c6c742f3d

I am using accelerate to implement distributed ppo training of gptj via the trl library. See here. To reproduce install the repo, switch to ‘neo-updates’ branch and ‘accelerate example config update’ commit and run accelerate launch test_trl_accelerate with the pasted accelerate config.

I have verfied the example nlp script works. Thanks for any help!

Expected behavior

no caffe2 allocation error is thrown.

Issue Analytics

State:
Created a year ago
Comments:8 (1 by maintainers)

Top GitHub Comments

2reactions

pacman100commented, Jul 27, 2022

Hello,

Spent a major part of today diving deep into this. Observing very weird behaviour but got a small script to work.

The code (dist_gen.py) is below:

import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)
from accelerate import Accelerator
from accelerate.logging import get_logger

def main():
    accelerator = Accelerator()
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    model.config.pad_token_id = model.config.eos_token_id
    model = accelerator.prepare(model)
    model.to(accelerator.device)
    print(accelerator.state)
    accelerator.print(model)
    
    rank = torch.distributed.get_rank()
    if rank == 0:
        text_in = "The purpose of life is "
    elif rank == 1:
        text_in = "Are you human? "

    batch = tokenizer(text_in, return_tensors="pt").to(accelerator.device)
    
    # had to run this 1 time at the start else was giving device mismatch error.
    # So, before directly using `model.generate` pass a batch with dummy data through the model 
    outputs = model(**batch)
    
    print(batch)
    gen_kwargs = {
        "max_length": 64,
        "num_beams": 10,
        "min_length": 20,
        "length_penalty": False,
        "no_repeat_ngram_size": 3,
        "repetition_penalty": 1.2,
    }
    with torch.no_grad():
        unwrapped_model = accelerator.unwrap_model(model)
        # synced_gpus was necessary else resulted into indefinite hang
        outputs = unwrapped_model.generate(batch["input_ids"], synced_gpus=True, **gen_kwargs)

    text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nrank{rank}:\n   in={text_in}\n  out={text_out}")

    
if __name__ == "__main__":
    main()

accelerate config (predict_fsdp.yaml):

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: GPT2Block
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

Command ran:

accelerate launch --config_file ~/predict_fsdp.yaml ~/dist_gen.py

Output:
Peculiar behaviour: a. Had to run model(**dummy_batch) 1 time at the start else was giving device mismatch error. So, before directly using model.generate pass a batch with dummy data through the model
b. synced_gpus=True to be passed to model.generate was necessary else resulted into indefinite hang

Please try these changes and let me know if that fixes the issue

1reaction

pacman100commented, Aug 2, 2022

Hello @Dahoas, shared embedding layers should belong to the same FSDP unit and sized_based_wrap puts them in different units leading to an error. Hence, for transformers TRANSFORMER_BASED_WRAP should be used. For the model you are using from trl, look at the name of the attention blocks and pass it to the fsdp_transformer_layer_cls_to_wrap to overcome Exception: Could not find the transformer layer class to wrap in the model. error. I looked at it and you should change from GPT2Block to Block. After all these changes, I tried running the code you shared, I am getting new error and as it is unrelated to the integration, it would be better to create an issue with PyTorch repo and follow it there. NO_WRAP policy will have the least advantage in reducing memory and as such would almost be like using DDP without parameter sharding.

output logs:

(11): FullyShardedDataParallel(                                                 [147/1901]
            (_fsdp_wrapped_module): FlattenParamsWrapper(                                           
              (_fpw_module): Block(                                                                 
                (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)                       
                (attn): Attention(                                                                  
                  (c_attn): Conv1D()                                                                
                  (c_proj): Conv1D()
                  (attn_dropout): Dropout(p=0.1, inplace=False)
                  (resid_dropout): Dropout(p=0.1, inplace=False)
                )
                (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MLP(
                  (c_fc): Conv1D()
                  (c_proj): Conv1D()
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
            )
          )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (lm_head): Linear(in_features=768, out_features=50257, bias=False)
      (v_head): ValueHead(
        (summary): Linear(in_features=768, out_features=1, bias=True)
        (activation): Identity()
        (first_dropout): Dropout(p=0.1, inplace=False)
        (last_dropout): Identity()
        (flatten): Flatten(start_dim=1, end_dim=-1)
      )
    )
)                                                                                                 
)
Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

tensor([[ 464, 4007,  286, 1204,  318,  220]], device='cuda:0')tensor([[8491,  345, 1692,   30,  220
]], device='cuda:1')


rank0:
   in=The purpose of life is 
  out=The purpose of life is  to live. It is to live in the world. It is to live in the world. It is
 to live in the world. It is to live in the world. It is to live in the world. It is to live in the 
world. It is to live in the
Traceback (most recent call last):
  File "/home/sourab/dist_gen_training.py", line 122, in <module>
Traceback (most recent call last):
  File "/home/sourab/dist_gen_training.py", line 122, in <module>
    main()
  File "/home/sourab/dist_gen_training.py", line 50, in main
    main()
  File "/home/sourab/dist_gen_training.py", line 63, in main                                        
    response_tensors = unwrapped_model.generate(query_tensors, synced_gpus=True, **gen_kwargs)      
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decor
ate_context
    logits, _, v = model(input_ids)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _cal
l_impl
    return func(*args, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/transformers/generation_utils.py", line 905, in
 generate
    return self.greedy_search(
  File "/home/sourab/dev/lib/python3.8/site-packages/transformers/generation_utils.py", line 1173, i
n greedy_search
    outputs = self(
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _cal
l_impl
    return forward_call(*input, **kwargs)
      File "/home/sourab/dev/lib/python3.8/site-packages/trl/gpt2.py", line 109, in forward
return forward_call(*input, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_paral
lel.py", line 2272, in forward
    transformer_outputs = self.transformer(
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", lin
e 747, in forward
    outputs = block(
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _cal
l_impl
    self._rebuild_full_params()
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decor
ate_context
    return func(*args, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_paral
lel.py", line 3048, in _rebuild_full_params
    return forward_call(*input, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_paral
lel.py", line 2272, in forward
    self._rebuild_full_params()
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decor
ate_context
    return func(*args, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_paral
lel.py", line 3048, in _rebuild_full_params
    self._check_rebuild_full_params(p)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_paral
lel.py", line 3232, in _check_rebuild_full_params
    self._check_rebuild_full_params(p)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_paral
lel.py", line 3232, in _check_rebuild_full_params
    raise RuntimeError(
RuntimeError: Forward order differs across ranks: rank 0 is rebuilding full parameters in `forward()
` for ['transformer.wte.weight', 'transformer.wpe.weight', 'transformer.ln_f.weight', 'transformer.l
n_f.bias', 'v_head.summary.weight', 'v_head.summary.bias'] while rank 1 is rebuilding full parameter
s in `forward()` for ['transformer.h.0.ln_1.weight', 'transformer.h.0.ln_1.bias', 'transformer.h.0.a
ttn.c_attn.weight', 'transformer.h.0.attn.c_attn.bias', 'transformer.h.0.attn.c_proj.weight', 'trans
former.h.0.attn.c_proj.bias', 'transformer.h.0.ln_2.weight', 'transformer.h.0.ln_2.bias', 'transform
er.h.0.mlp.c_fc.weight', 'transformer.h.0.mlp.c_fc.bias', 'transformer.h.0.mlp.c_proj.weight', 'tran
sformer.h.0.mlp.c_proj.bias']
    raise RuntimeError(
RuntimeError: Forward order differs across ranks: rank 0 is rebuilding full parameters in `forward()
` for ['transformer.wte.weight', 'transformer.wpe.weight', 'transformer.ln_f.weight', 'transformer.l
n_f.bias', 'v_head.summary.weight', 'v_head.summary.bias'] while rank 1 is rebuilding full parameter
s in `forward()` for ['transformer.h.0.ln_1.weight', 'transformer.h.0.ln_1.bias', 'transformer.h.0.a
ttn.c_attn.weight', 'transformer.h.0.attn.c_attn.bias', 'transformer.h.0.attn.c_proj.weight', 'trans
former.h.0.attn.c_proj.bias', 'transformer.h.0.ln_2.weight', 'transformer.h.0.ln_2.bias', 'transform
er.h.0.mlp.c_fc.weight', 'transformer.h.0.mlp.c_fc.bias', 'transformer.h.0.mlp.c_proj.weight', 'tran
sformer.h.0.mlp.c_proj.bias']
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1351488
) of binary: /home/sourab/dev/bin/python
Traceback (most recent call last):
  File "/home/sourab/dev/bin/torchrun", line 8, in <module>
    sys.exit(main())
File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/error
s/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, i
n __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, i
n launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: