caffe2 error in forward method when using fsdp
See original GitHub issueSystem Info
- `Accelerate` version: 0.11.0
- Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.8.5
- Numpy version: 1.23.1
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'min_num_params': 2000, 'offload_params': False, 'sharding_strategy': 1}
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
I attach a full error paste here: https://carperai.notion.site/RuntimeError-The-tensor-has-a-non-zero-number-of-elements-but-its-data-is-not-allocated-yet-Caffe-1cde9ba4104e47c2be65377c6c742f3d
I am using accelerate to implement distributed ppo training of gptj via the trl library. See here. To reproduce install the repo, switch to ‘neo-updates’ branch and ‘accelerate example config update’ commit and run accelerate launch test_trl_accelerate
with the pasted accelerate config.
I have verfied the example nlp script works. Thanks for any help!
Expected behavior
no caffe2 allocation error is thrown.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (1 by maintainers)
Top Results From Across the Web
[FSDP] caffe2 error in forward method when using fsdp #82461
Describe the bug. When using FSDP, during inference/evaluation using transformers (gpt2, blenderbot, t5 ...) for generation, i.e., model.
Read more >FullyShardedDataParallel — PyTorch 1.13 documentation
Inputs into FSDP forward function will be moved to compute device (same device FSDP module is on) before running forward , so user...
Read more >Train With Mixed Precision - NVIDIA Documentation Center
Mixed precision methods combine the use of different numerical formats ... We trained a number of feed-forward and recurrent networks with ...
Read more >PyTorch - Browse /v1.12.0 at SourceForge.net
GroupNorm would error out during the forward pass if num_channels is not divisible by num_groups . ... In 1.11, when wrapping a model...
Read more >Forward Pass in Caffe NN in parallel - python - Stack Overflow
I had a very similar Problem. using caffe.set_mode_gpu(). in the child thread itself solved it. So try: class MultProcessingWorker(mp.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello,
Spent a major part of today diving deep into this. Observing very weird behaviour but got a small script to work.
Output:
Peculiar behaviour: a. Had to run
model(**dummy_batch)
1 time at the start else was giving device mismatch error. So, before directly usingmodel.generate
pass a batch with dummy data through the modelb.
synced_gpus=True
to be passed tomodel.generate
was necessary else resulted into indefinite hangPlease try these changes and let me know if that fixes the issue
Hello @Dahoas, shared embedding layers should belong to the same FSDP unit and
sized_based_wrap
puts them in different units leading to an error. Hence, for transformersTRANSFORMER_BASED_WRAP
should be used. For the model you are using fromtrl
, look at the name of the attention blocks and pass it to thefsdp_transformer_layer_cls_to_wrap
to overcomeException: Could not find the transformer layer class to wrap in the model.
error. I looked at it and you should change fromGPT2Block
toBlock
. After all these changes, I tried running the code you shared, I am getting new error and as it is unrelated to the integration, it would be better to create an issue with PyTorch repo and follow it there.NO_WRAP
policy will have the least advantage in reducing memory and as such would almost be like using DDP without parameter sharding.output logs: