FSDP - TypeError: load_state_dict() got an unexpected keyword argument 'strict'
See original GitHub issueSystem Info
- `transformers` version: 4.22.0.dev0
- Platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
- Python version: 3.7.10
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behaviour:
- Clone transformers -
git clone https://github.com/huggingface/transformers.git
- move to transformers folder -
cd transformers
- Install from source -
pip install .
- Move to image-classification example -
cd examples/pytorch/image-classification
- Train the model using fsdp
torchrun --nproc_per_node=4 run_image_classification.py --dataset_name beans --output_dir ./beans_outputs/ --remove_unused_columns False --do_train --do_eval --learning_rate 2e-5 --num_train_epochs 5 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --logging_strategy steps --logging_steps 10 --evaluation_strategy epoch --save_strategy epoch --load_best_model_at_end True --save_total_limit 3 --seed 1337 --fsdp "full_shard auto_wrap"
Expected behavior
Model should get finetuned and saved successfully.
However, the following error is produced
[INFO|trainer.py:1949] 2022-08-07 08:35:00,771 >> Loading best model from ./beans_outputs/checkpoint-165 (score: 0.19044387340545654).
Traceback (most recent call last):
File "run_image_classification.py", line 384, in <module>
main()
File "run_image_classification.py", line 358, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
self._load_best_model()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "run_image_classification.py", line 384, in <module>
File "run_image_classification.py", line 384, in <module>
File "run_image_classification.py", line 384, in <module>
main()main()
File "run_image_classification.py", line 358, in main
File "run_image_classification.py", line 358, in main
main()
File "run_image_classification.py", line 358, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1509, in train
ignore_keys_for_eval=ignore_keys_for_eval,ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1867, in _inner_training_loop
self._load_best_model()self._load_best_model()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
self._load_best_model()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1992, in _load_best_model
load_result = model.load_state_dict(state_dict, strict=False)load_result = model.load_state_dict(state_dict, strict=False)
TypeErrorTypeError: : load_state_dict() got an unexpected keyword argument 'strict'load_state_dict() got an unexpected keyword argument 'strict'
load_result = model.load_state_dict(state_dict, strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
Full example log - fsdp_error.txt
Torch environment details:
PyTorch version: 1.12.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: glibc-2.10
Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1072-aws-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mlflow-torchserve==0.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.6
[pip3] numpydoc==1.1.0
[pip3] pytorch-kfp-components==0.1.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.12.0
[pip3] torch-model-archiver==0.6.0
[pip3] torch-optimizer==0.1.0
[pip3] torch-workflow-archiver==0.2.4b20220511
[pip3] torchdata==0.4.0
[pip3] torchmetrics==0.7.3
[pip3] torchserve==0.6.0
[pip3] torchtext==0.13.0
[pip3] torchvision==0.13.0
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py37he8ac12f_0
[conda] mkl_fft 1.2.1 py37h54f3939_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] mlflow-torchserve 0.2.0 pypi_0 pypi
[conda] numpy 1.21.6 pypi_0 pypi
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch-kfp-components 0.1.0 pypi_0 pypi
[conda] pytorch-lightning 1.6.5 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch 1.12.0 pypi_0 pypi
[conda] torch-model-archiver 0.6.0 pypi_0 pypi
[conda] torch-optimizer 0.1.0 pypi_0 pypi
[conda] torch-workflow-archiver 0.2.4b20220511 pypi_0 pypi
[conda] torchdata 0.4.0 pypi_0 pypi
[conda] torchmetrics 0.7.3 pypi_0 pypi
[conda] torchserve 0.6.0 pypi_0 pypi
[conda] torchtext 0.13.0 pypi_0 pypi
[conda] torchvision 0.13.0 pypi_0 pypi
the issue seems to be appearing after this commit .
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
`strict` keyword doesn't exist in the optimizer's `load_state_dict`
... gives me "TypeError: load_state_dict() got an unexpected keyword argument 'strict'" opt.load_state_dict(checkpoint['opt_state_dict'], ...
Read more >load_state_dict() got an unexpected keyword argument 'strict ...
When I try to finetune a model, I set the 'strict' argument to load the state_dict, however, an error appears: “TypeError: load_state_dict() ...
Read more >Unexpected keyword argument 'strict' - python - Stack Overflow
From the code below (Python), I get error "TypeError: __init__() got an unexpected keyword argument 'strict'". What is the problem?
Read more >Saving and Loading Models — PyTorch Tutorials 1.0.0 ...
Module model is contained in the model's parameters (accessed with model.parameters() ). A state_dict is simply a Python dictionary object that maps each...
Read more >pytorch __init__() got an unexpected keyword argument 'train'
The error implies that it couldn't recognize the train parameter, in the ImageFolder class. ImageFolder() does not have a parameter train, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Also, when using
auto_wrap
please specify either--fsdp_transformer_layer_cls_to_wrap <value>
or--fsdp_min_num_params <number>
as part of cmd arguments. This is what enables sharding of parameters, gradients and optimizer state across GPUs so that peak memory usage is further decreased drastically and you get the most out of using FSDP. For more details, please refer https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html and https://pytorch.org/docs/1.12/fsdp.html?highlight=fsdp#module-torch.distributed.fsdp.🤗 Trainer FSDP integration doc is being updated to reflect the recent updates in this PR https://github.com/huggingface/transformers/pull/18521. Please refer it for more details.
@rohan-varma Thank you very much. I applied the fix as given in the screenshot and compiled from source. The model is gettting saved in the fsdp mode.
Attached image and logs for the same
vit_fsdp_with_fix.txt