zero_optimization.cpu_offload: true leads to a silent crash
See original GitHub issueI’m experimenting with various zero_optimization
config options and I noticed that when I flip to true
zero_optimization.cpu_offload
, the application exits w/o crashing or doing any training.
{
"train_batch_size": 20,
"steps_per_print": 2000,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients": false,
"cpu_offload": false
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 3e-5,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500
}
},
"wall_clock_breakdown": false
}
leads to a silent exit but doing nothing:
Full log
export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=…/…/src USE_TF=0 deepspeed ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json rm: cannot remove ‘output_dir’: No such file or directory [2020-12-18 19:42:37,871] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2020-12-18 19:42:37,897] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json [2020-12-18 19:42:38,631] [INFO] [launch.py:78:main] WORLD INFO DICT: {‘localhost’: [0, 1]} [2020-12-18 19:42:38,631] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0 [2020-12-18 19:42:38,631] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class ‘list’>, {‘localhost’: [0, 1]}) [2020-12-18 19:42:38,631] [INFO] [launch.py💯main] dist_world_size=2 [2020-12-18 19:42:38,631] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1 [‘–deepspeed’, ‘–deepspeed_config’, ‘ds_config.json’] 1 2020-12-18 19:42:40 | WARNING | main | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False 2020-12-18 19:42:40 | INFO | main | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir=‘output_dir’, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: ‘no’>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir=‘runs/Dec18_19-42-40_hope’, logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=‘O1’, local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name=‘output_dir’, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend=‘auto’, sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler=‘linear’) [‘–deepspeed’, ‘–deepspeed_config’, ‘ds_config.json’] 0 2020-12-18 19:42:40 | WARNING | main | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 2020-12-18 19:42:40 | INFO | main | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir=‘output_dir’, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: ‘no’>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir=‘runs/Dec18_19-42-40_hope’, logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=‘O1’, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name=‘output_dir’, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend=‘auto’, sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler=‘linear’) [INFO|configuration_utils.py:431] 2020-12-18 19:42:41,139 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e [INFO|configuration_utils.py:467] 2020-12-18 19:42:41,141 >> Model config MBartConfig { “_num_labels”: 3, “activation_dropout”: 0.0, “activation_function”: “gelu”, “add_bias_logits”: false, “add_final_layer_norm”: true, “architectures”: [ “BartForConditionalGeneration” ], “attention_dropout”: 0.0, “bos_token_id”: 0, “classif_dropout”: 0.0, “classifier_dropout”: 0.0, “d_model”: 1024, “decoder_attention_heads”: 16, “decoder_ffn_dim”: 4096, “decoder_layerdrop”: 0.0, “decoder_layers”: 4, “decoder_start_token_id”: 250020, “do_blenderbot_90_layernorm”: false, “dropout”: 0.1, “encoder_attention_heads”: 16, “encoder_ffn_dim”: 4096, “encoder_layerdrop”: 0.0, “encoder_layers”: 12, “eos_token_id”: 2, “extra_pos_embeddings”: 2, “force_bos_token_to_be_generated”: false, “id2label”: { “0”: “LABEL_0”, “1”: “LABEL_1”, “2”: “LABEL_2” }, “init_std”: 0.02, “is_encoder_decoder”: true, “label2id”: { “LABEL_0”: 0, “LABEL_1”: 1, “LABEL_2”: 2 }, “max_length”: 1000, “max_position_embeddings”: 1024, “model_type”: “mbart”, “normalize_before”: true, “normalize_embedding”: true, “num_beams”: 5, “num_hidden_layers”: 12, “output_past”: true, “pad_token_id”: 1, “save_step”: 7, “scale_embedding”: true, “static_position_embeddings”: false, “use_cache”: true, “variant”: “prelayernorm”, “vocab_size”: 250027 }
[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,415 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e [INFO|configuration_utils.py:467] 2020-12-18 19:42:41,417 >> Model config MBartConfig { “_num_labels”: 3, “activation_dropout”: 0.0, “activation_function”: “gelu”, “add_bias_logits”: false, “add_final_layer_norm”: true, “architectures”: [ “BartForConditionalGeneration” ], “attention_dropout”: 0.0, “bos_token_id”: 0, “classif_dropout”: 0.0, “classifier_dropout”: 0.0, “d_model”: 1024, “decoder_attention_heads”: 16, “decoder_ffn_dim”: 4096, “decoder_layerdrop”: 0.0, “decoder_layers”: 4, “decoder_start_token_id”: 250020, “do_blenderbot_90_layernorm”: false, “dropout”: 0.1, “encoder_attention_heads”: 16, “encoder_ffn_dim”: 4096, “encoder_layerdrop”: 0.0, “encoder_layers”: 12, “eos_token_id”: 2, “extra_pos_embeddings”: 2, “force_bos_token_to_be_generated”: false, “id2label”: { “0”: “LABEL_0”, “1”: “LABEL_1”, “2”: “LABEL_2” }, “init_std”: 0.02, “is_encoder_decoder”: true, “label2id”: { “LABEL_0”: 0, “LABEL_1”: 1, “LABEL_2”: 2 }, “max_length”: 1000, “max_position_embeddings”: 1024, “model_type”: “mbart”, “normalize_before”: true, “normalize_embedding”: true, “num_beams”: 5, “num_hidden_layers”: 12, “output_past”: true, “pad_token_id”: 1, “save_step”: 7, “scale_embedding”: true, “static_position_embeddings”: false, “use_cache”: true, “variant”: “prelayernorm”, “vocab_size”: 250027 }
[INFO|tokenization_utils_base.py:1718] 2020-12-18 19:42:41,418 >> Model name ‘sshleifer/distill-mbart-en-ro-12-4’ not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming ‘sshleifer/distill-mbart-en-ro-12-4’ is a path, a model identifier, or url to a directory containing tokenizer files. [INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/62ed1799c9b9a3c199222637281d38762ae87e00165a2613e31c93b3673f08b8.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8 [INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/added_tokens.json from cache at None [INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/9423d956f3dd4d8fd97112a8d3f87081f6256ce54ccfecd27938c48e294b8aa8.72fa8565f9c8b5dc27e7ac070020aec80359d9da2e5628b3f313f41bf44d322c [INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/f5629ec54e86b66e2e9879777df84ce24ede4c93495e6ce9f9161011260c5344.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8 [INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer.json from cache at None [INFO|tokenization_utils_base.py:925] 2020-12-18 19:42:43,989 >> Assigning [‘ar_AR’, ‘cs_CZ’, ‘de_DE’, ‘en_XX’, ‘es_XX’, ‘et_EE’, ‘fi_FI’, ‘fr_XX’, ‘gu_IN’, ‘hi_IN’, ‘it_IT’, ‘ja_XX’, ‘kk_KZ’, ‘ko_KR’, ‘lt_LT’, ‘lv_LV’, ‘my_MM’, ‘ne_NP’, ‘nl_XX’, ‘ro_RO’, ‘ru_RU’, ‘si_LK’, ‘tr_TR’, ‘vi_VN’, ‘zh_CN’] to the additional_special_tokens key of the tokenizer [INFO|modeling_utils.py:1024] 2020-12-18 19:42:44,314 >> loading weights file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d2a7ade93d629fb16e06233407ab8aa0e70af5532c66c3b38ce2ff905743bf78.fa8ebf3af9c5dec8982ce624e74de87e85c9a944e776b79b8e8bd65126ed2073 Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: [‘lm_head.weight’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:1045] 2020-12-18 19:43:06,939 >> load time=0.8602 [2020-12-18 19:43:07,280] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master [2020-12-18 19:43:07,280] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl [INFO|modeling_utils.py:1145] 2020-12-18 19:43:07,318 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.
[WARNING|modeling_utils.py:1147] 2020-12-18 19:43:07,318 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: [‘lm_head.weight’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [2020-12-18 19:43:07,512] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master [2020-12-18 19:43:07,512] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl [2020-12-18 19:43:11,225] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2 [2020-12-18 19:43:11,229] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2 Adam Optimizer #0 is created with AVX2 arithmetic capability. Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1 [2020-12-18 19:43:13,258] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale Adam Optimizer #0 is created with AVX2 arithmetic capability. Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1 [2020-12-18 19:43:13,262] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer [2020-12-18 19:43:13,262] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam ( Parameter Group 0 amsgrad: False betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 weight_decay: 3e-07 ) [2020-12-18 19:43:13,262] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale [2020-12-18 19:43:13,262] [INFO] [unfused_optimizer.py:36:init] Fused Lamb Legacy : False group 0 param 0 = 1048576 group 0 param 0 = 1048576
If I flip zero_optimization.cpu_offload
to false
everything works:
Full log
export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 deepspeed ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json rm: cannot remove 'output_dir': No such file or directory [2020-12-18 20:29:55,608] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2020-12-18 20:29:55,634] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json [2020-12-18 20:29:56,371] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]} [2020-12-18 20:29:56,372] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0 [2020-12-18 20:29:56,372] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2020-12-18 20:29:56,372] [INFO] [launch.py:100:main] dist_world_size=2 [2020-12-18 20:29:56,372] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1 ['--deepspeed', '--deepspeed_config', 'ds_config.json'] 1 2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False 2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear') ['--deepspeed', '--deepspeed_config', 'ds_config.json'] 0 2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear') [INFO|configuration_utils.py:431] 2020-12-18 20:29:58,890 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0 [INFO|configuration_utils.py:467] 2020-12-18 20:29:58,892 >> Model config MBartConfig { "_num_labels": 3, "activation_dropout": 0.0, "activation_function": "gelu", "add_bias_logits": false, "add_final_layer_norm": true, "architectures": [ "BartForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 0, "classif_dropout": 0.0, "classifier_dropout": 0.0, "d_model": 2, "decoder_attention_heads": 1, "decoder_ffn_dim": 4, "decoder_layerdrop": 0.0, "decoder_layers": 2, "do_blenderbot_90_layernorm": false, "dropout": 0.1, "encoder_attention_heads": 1, "encoder_ffn_dim": 4, "encoder_layerdrop": 0.0, "encoder_layers": 2, "eos_token_id": 2, "extra_pos_embeddings": 2, "force_bos_token_to_be_generated": false, "id2label": { "0": "LABEL_0", "1": "LABEL_1", "2": "LABEL_2" }, "init_std": 0.02, "is_encoder_decoder": true, "label2id": { "LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2 }, "max_position_embeddings": 1024, "model_type": "mbart", "normalize_before": true, "normalize_embedding": true, "num_beams": 2, "num_hidden_layers": 2, "output_past": true, "pad_token_id": 1, "scale_embedding": true, "static_position_embeddings": false, "use_cache": true, "vocab_size": 250027 }[INFO|configuration_utils.py:431] 2020-12-18 20:29:59,191 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0 [INFO|configuration_utils.py:467] 2020-12-18 20:29:59,192 >> Model config MBartConfig { “_num_labels”: 3, “activation_dropout”: 0.0, “activation_function”: “gelu”, “add_bias_logits”: false, “add_final_layer_norm”: true, “architectures”: [ “BartForConditionalGeneration” ], “attention_dropout”: 0.0, “bos_token_id”: 0, “classif_dropout”: 0.0, “classifier_dropout”: 0.0, “d_model”: 2, “decoder_attention_heads”: 1, “decoder_ffn_dim”: 4, “decoder_layerdrop”: 0.0, “decoder_layers”: 2, “do_blenderbot_90_layernorm”: false, “dropout”: 0.1, “encoder_attention_heads”: 1, “encoder_ffn_dim”: 4, “encoder_layerdrop”: 0.0, “encoder_layers”: 2, “eos_token_id”: 2, “extra_pos_embeddings”: 2, “force_bos_token_to_be_generated”: false, “id2label”: { “0”: “LABEL_0”, “1”: “LABEL_1”, “2”: “LABEL_2” }, “init_std”: 0.02, “is_encoder_decoder”: true, “label2id”: { “LABEL_0”: 0, “LABEL_1”: 1, “LABEL_2”: 2 }, “max_position_embeddings”: 1024, “model_type”: “mbart”, “normalize_before”: true, “normalize_embedding”: true, “num_beams”: 2, “num_hidden_layers”: 2, “output_past”: true, “pad_token_id”: 1, “scale_embedding”: true, “static_position_embeddings”: false, “use_cache”: true, “vocab_size”: 250027 }
[INFO|tokenization_utils_base.py:1718] 2020-12-18 20:29:59,192 >> Model name ‘sshleifer/tiny-mbart’ not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming ‘sshleifer/tiny-mbart’ is a path, a model identifier, or url to a directory containing tokenizer files. [INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/13a2c62c1dabc5357bc38b0694f5829f3db0708d51f1a0f07734f62cc0a825a0.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8 [INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/added_tokens.json from cache at None [INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/33fa7894ab257a74cede3060dca6d2fc609918785e80160f6c057723ece47292.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8 [INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/e9c580e6446c42ed20fb148206f2a9bd75a825278ffa029df063682077d45bb6.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8 [INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer.json from cache at None [INFO|tokenization_utils_base.py:925] 2020-12-18 20:30:01,779 >> Assigning [‘ar_AR’, ‘cs_CZ’, ‘de_DE’, ‘en_XX’, ‘es_XX’, ‘et_EE’, ‘fi_FI’, ‘fr_XX’, ‘gu_IN’, ‘hi_IN’, ‘it_IT’, ‘ja_XX’, ‘kk_KZ’, ‘ko_KR’, ‘lt_LT’, ‘lv_LV’, ‘my_MM’, ‘ne_NP’, ‘nl_XX’, ‘ro_RO’, ‘ru_RU’, ‘si_LK’, ‘tr_TR’, ‘vi_VN’, ‘zh_CN’] to the additional_special_tokens key of the tokenizer Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: [‘lm_head.weight’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:1024] 2020-12-18 20:30:02,107 >> loading weights file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d6eec704737db03a21a794f08b07fcbb71d855562a992cfb1be6193b37a7ff68.61ce63751e40ea882dd1a22b6c9303b954b81ec69d631ab0541750fd856720be [INFO|modeling_utils.py:1045] 2020-12-18 20:30:02,150 >> load time=0.0017 [INFO|modeling_utils.py:1145] 2020-12-18 20:30:02,152 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.
[WARNING|modeling_utils.py:1147] 2020-12-18 20:30:02,152 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: [‘lm_head.weight’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [2020-12-18 20:30:02,195] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master [2020-12-18 20:30:02,195] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl [2020-12-18 20:30:02,339] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master [2020-12-18 20:30:02,339] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl [2020-12-18 20:30:05,642] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2 [2020-12-18 20:30:05,645] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2 [2020-12-18 20:30:05,674] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer [2020-12-18 20:30:05,674] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam ( Parameter Group 0 betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 weight_decay: 3e-07 ) [2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale [2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale [2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam ( Parameter Group 0 betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 step: 1 weight_decay: 3e-07 ) [2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam ( Parameter Group 0 betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 step: 1 weight_decay: 3e-07 ) [2020-12-18 20:30:05,680] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {‘dynamic_loss_scale’: True, ‘cur_scale’: 4294967296, ‘cur_iter’: 0, ‘last_overflow_iter’: -1, ‘scale_factor’: 2, ‘scale_window’: 1000, ‘optimizer_state_dict’: {‘state’: {0: {‘exp_avg’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:1’), ‘exp_avg_sq’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:1’)}}, ‘param_groups’: [{‘lr’: 3e-05, ‘bias_correction’: True, ‘betas’: [0.8, 0.999], ‘eps’: 1e-08, ‘weight_decay’: 3e-07, ‘step’: 1, ‘params’: [0]}]}, ‘fp32_groups_flat’: [tensor([-3.6163e-02, -1.1017e-02, 1.9646e-03, -9.6741e-03, 0.0000e+00, 0.0000e+00, 1.9623e-02, 1.2726e-02, -4.2610e-03, -8.0185e-03, 0.0000e+00, 0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02, 3.1891e-02, 0.0000e+00, 0.0000e+00, 1.1742e-02, 2.5101e-02, -1.1864e-02, -7.1220e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 2.5635e-02, 1.0338e-02, -1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02, 3.1799e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.6591e-02, 6.4888e-03, 2.2934e-02, -1.4061e-02, -4.8256e-03, 1.2184e-02, -2.0172e-02, -1.9394e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.2901e-02, 4.0054e-03, 8.0338e-03, -1.1307e-02, 0.0000e+00, 0.0000e+00, 2.8641e-02, 4.8184e-04, -1.0582e-02, 1.1536e-02, 0.0000e+00, 0.0000e+00, -1.0925e-02, -7.4043e-03, 9.5320e-04, 3.4504e-03, 0.0000e+00, 0.0000e+00, 1.7471e-02, 2.3289e-03, 2.1545e-02, 2.8915e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -3.9185e-02, -1.3550e-02, 2.9087e-03, 9.9945e-04, 2.0447e-02, -2.4887e-02, 1.3676e-03, 4.8523e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -4.0253e-02, -1.5764e-03, -4.0039e-02, -2.2980e-02, 1.1307e-02, 4.4373e-02, 1.8646e-02, -2.0630e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -1.5434e-02, 4.0321e-03, 9.0714e-03, 1.0330e-02, 0.0000e+00, 0.0000e+00, -4.5776e-03, -3.0075e-02, 8.6670e-03, -2.1652e-02, 0.0000e+00, 0.0000e+00, -2.4200e-02, 1.8417e-02, -2.5970e-02, 9.2010e-03, 0.0000e+00, 0.0000e+00, -8.5220e-03, -6.2332e-03, -1.0139e-02, -8.6823e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -1.4549e-02, -2.5162e-02, -1.4793e-02, 1.6220e-02, 0.0000e+00, 0.0000e+00, -2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03, 0.0000e+00, 0.0000e+00, 1.1015e-03, -1.5366e-02, 3.3813e-02, -1.7052e-03, 0.0000e+00, 0.0000e+00, 2.7100e-02, 7.7667e-03, -3.0640e-02, -2.1133e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 6.5536e-03, -1.3023e-02, -7.0572e-04, -1.0208e-02, 6.4087e-03, 5.1575e-03, 1.9257e-02, 2.7344e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -3.2867e-02, 2.7817e-02, -2.0920e-02, 2.7580e-03, -1.8356e-02, -2.4857e-02, -1.5450e-02, -1.2680e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 8.5144e-03, -1.6571e-02, -5.7106e-03, -2.2568e-02, 0.0000e+00, 0.0000e+00, 3.8319e-03, -1.2337e-02, -1.1345e-02, -4.2847e-02, 0.0000e+00, 0.0000e+00, -5.4741e-03, -2.9114e-02, 8.7662e-03, 2.9564e-03, 0.0000e+00, 0.0000e+00, 1.7075e-02, 1.0483e-02, -2.0325e-02, 3.5675e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -1.4648e-02, -2.5375e-02, 1.4200e-03, -5.0621e-03, 0.0000e+00, 0.0000e+00, 2.5284e-02, 1.3382e-02, 5.9319e-03, -1.9791e-02, 0.0000e+00, 0.0000e+00, 4.7821e-02, 2.8944e-04, -3.6407e-02, 2.6886e-02, 0.0000e+00, 0.0000e+00, -3.4424e-02, 8.2550e-03, -1.9302e-02, 3.7476e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0750e-02, -3.7804e-03, 3.7689e-02, -1.9821e-02, -1.4641e-02, 1.4755e-02, -3.3321e-03, 2.1469e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -6.6643e-03, -8.9407e-05, 1.4587e-02, 2.7637e-03, 9.8190e-03, 2.0325e-02, -4.8950e-02, -2.8954e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00], device=‘cuda:1’, requires_grad=True)], ‘clip_grad’: 0.0} FusedAdam ( Parameter Group 0 betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 step: 1 weight_decay: 3e-07 ) <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fee4132d5e0> [2020-12-18 20:30:05,681] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {‘dynamic_loss_scale’: True, ‘cur_scale’: 4294967296, ‘cur_iter’: 0, ‘last_overflow_iter’: -1, ‘scale_factor’: 2, ‘scale_window’: 1000, ‘optimizer_state_dict’: {‘state’: {0: {‘exp_avg’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:0’), ‘exp_avg_sq’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:0’)}}, ‘param_groups’: [{‘lr’: 3e-05, ‘bias_correction’: True, ‘betas’: [0.8, 0.999], ‘eps’: 1e-08, ‘weight_decay’: 3e-07, ‘step’: 1, ‘params’: [0]}]}, ‘fp32_groups_flat’: [tensor([-3.6163e-02, -1.1017e-02, 1.9646e-03, -9.6741e-03, 0.0000e+00, 0.0000e+00, 1.9623e-02, 1.2726e-02, -4.2610e-03, -8.0185e-03, 0.0000e+00, 0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02, 3.1891e-02, 0.0000e+00, 0.0000e+00, 1.1742e-02, 2.5101e-02, -1.1864e-02, -7.1220e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 2.5635e-02, 1.0338e-02, -1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02, 3.1799e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.6591e-02, 6.4888e-03, 2.2934e-02, -1.4061e-02, -4.8256e-03, 1.2184e-02, -2.0172e-02, -1.9394e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.2901e-02, 4.0054e-03, 8.0338e-03, -1.1307e-02, 0.0000e+00, 0.0000e+00, 2.8641e-02, 4.8184e-04, -1.0582e-02, 1.1536e-02, 0.0000e+00, 0.0000e+00, -1.0925e-02, -7.4043e-03, 9.5320e-04, 3.4504e-03, 0.0000e+00, 0.0000e+00, 1.7471e-02, 2.3289e-03, 2.1545e-02, 2.8915e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -3.9185e-02, -1.3550e-02, 2.9087e-03, 9.9945e-04, 2.0447e-02, -2.4887e-02, 1.3676e-03, 4.8523e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -4.0253e-02, -1.5764e-03, -4.0039e-02, -2.2980e-02, 1.1307e-02, 4.4373e-02, 1.8646e-02, -2.0630e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -1.5434e-02, 4.0321e-03, 9.0714e-03, 1.0330e-02, 0.0000e+00, 0.0000e+00, -4.5776e-03, -3.0075e-02, 8.6670e-03, -2.1652e-02, 0.0000e+00, 0.0000e+00, -2.4200e-02, 1.8417e-02, -2.5970e-02, 9.2010e-03, 0.0000e+00, 0.0000e+00, -8.5220e-03, -6.2332e-03, -1.0139e-02, -8.6823e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -1.4549e-02, -2.5162e-02, -1.4793e-02, 1.6220e-02, 0.0000e+00, 0.0000e+00, -2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03, 0.0000e+00, 0.0000e+00, 1.1015e-03, -1.5366e-02, 3.3813e-02, -1.7052e-03, 0.0000e+00, 0.0000e+00, 2.7100e-02, 7.7667e-03, -3.0640e-02, -2.1133e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 6.5536e-03, -1.3023e-02, -7.0572e-04, -1.0208e-02, 6.4087e-03, 5.1575e-03, 1.9257e-02, 2.7344e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -3.2867e-02, 2.7817e-02, -2.0920e-02, 2.7580e-03, -1.8356e-02, -2.4857e-02, -1.5450e-02, -1.2680e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 8.5144e-03, -1.6571e-02, -5.7106e-03, -2.2568e-02, 0.0000e+00, 0.0000e+00, 3.8319e-03, -1.2337e-02, -1.1345e-02, -4.2847e-02, 0.0000e+00, 0.0000e+00, -5.4741e-03, -2.9114e-02, 8.7662e-03, 2.9564e-03, 0.0000e+00, 0.0000e+00, 1.7075e-02, 1.0483e-02, -2.0325e-02, 3.5675e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, -1.4648e-02, -2.5375e-02, 1.4200e-03, -5.0621e-03, 0.0000e+00, 0.0000e+00, 2.5284e-02, 1.3382e-02, 5.9319e-03, -1.9791e-02, 0.0000e+00, 0.0000e+00, 4.7821e-02, 2.8944e-04, -3.6407e-02, 2.6886e-02, 0.0000e+00, 0.0000e+00, -3.4424e-02, 8.2550e-03, -1.9302e-02, 3.7476e-02, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0750e-02, -3.7804e-03, 3.7689e-02, -1.9821e-02, -1.4641e-02, 1.4755e-02, -3.3321e-03, 2.1469e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -6.6643e-03, -8.9407e-05, 1.4587e-02, 2.7637e-03, 9.8190e-03, 2.0325e-02, -4.8950e-02, -2.8954e-03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00], device=‘cuda:0’, requires_grad=True)], ‘clip_grad’: 0.0} [2020-12-18 20:30:05,681] [INFO] [engine.py:457:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR [2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f303160d640> [2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]] [2020-12-18 20:30:05,681] [INFO] [config.py:644:print] DeepSpeedEngine configuration: [2020-12-18 20:30:05,681] [INFO] [config.py:648:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f303160db50> [2020-12-18 20:30:05,681] [INFO] [config.py:648:print] allreduce_always_fp32 … False [2020-12-18 20:30:05,681] [INFO] [config.py:648:print] amp_enabled … False [2020-12-18 20:30:05,681] [INFO] [config.py:648:print] amp_params … False [2020-12-18 20:30:05,681] [INFO] [config.py:648:print] disable_allgather … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] dump_state … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] dynamic_loss_scale_args … {‘init_scale’: 4294967296, ‘scale_window’: 1000, ‘delayed_shift’: 2, ‘min_scale’: 1} [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] fp16_enabled … True [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] global_rank … 0 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] gradient_accumulation_steps … 1 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] gradient_clipping … 0.0 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] gradient_predivide_factor … 1.0 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] initial_dynamic_scale … 4294967296 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] loss_scale … 0 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] memory_breakdown … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] optimizer_legacy_fusion … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] optimizer_name … adam [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] optimizer_params … {‘lr’: 3e-05, ‘betas’: [0.8, 0.999], ‘eps’: 1e-08, ‘weight_decay’: 3e-07, ‘adam_w_mode’: True} [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] pipeline … {‘stages’: ‘auto’, ‘partition’: ‘best’, ‘seed_layers’: False, ‘activation_checkpoint_interval’: 0} [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] pld_enabled … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] pld_params … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] prescale_gradients … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] scheduler_name … WarmupLR [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] scheduler_params … {‘warmup_min_lr’: 0, ‘warmup_max_lr’: 3e-05, ‘warmup_num_steps’: 500} 2020-12-18 20:30:05 | INFO | main | *** Train *** [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] sparse_attention … None [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] sparse_gradients_enabled … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] steps_per_print … 2000 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] tensorboard_enabled … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] tensorboard_job_name … DeepSpeedJobName [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] tensorboard_output_path … [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] train_batch_size … 20 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] train_micro_batch_size_per_gpu 10 2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to
Seq2SeqTrainer
,--lr_scheduler
arg is ignored. [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] wall_clock_breakdown … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] world_size … 2 [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] zero_allow_untested_optimizer False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] zero_config … { “allgather_bucket_size”: 500000000, “allgather_partitions”: true, “contiguous_gradients”: true, “cpu_offload”: false, “elastic_checkpoint”: true, “load_from_fp32_weights”: true, “overlap_comm”: false, “reduce_bucket_size”: 500000000, “reduce_scatter”: false, “stage”: 0 } [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] zero_enabled … False [2020-12-18 20:30:05,682] [INFO] [config.py:648:print] zero_optimization_stage … 0 [2020-12-18 20:30:05,682] [INFO] [config.py:650:print] json = { “fp16”:{ “enabled”:true, “hysteresis”:2, “loss_scale”:0, “loss_scale_window”:1000, “min_loss_scale”:1 }, “optimizer”:{ “params”:{ “adam_w_mode”:true, “betas”:[ 0.8, 0.999 ], “eps”:1e-08, “lr”:3e-05, “weight_decay”:3e-07 }, “type”:“Adam” }, “scheduler”:{ “params”:{ “warmup_max_lr”:3e-05, “warmup_min_lr”:0, “warmup_num_steps”:500 }, “type”:“WarmupLR” }, “steps_per_print”:2000, “train_batch_size”:20, “wall_clock_breakdown”:false, “zero_optimization”:{ “allgather_bucket_size”:500000000, “allgather_partitions”:true, “contiguous_gradients”:true, “cpu_offload”:false, “overlap_comm”:false, “reduce_bucket_size”:500000000, “reduce_scatter”:false, “stage”:0 } } FusedAdam ( Parameter Group 0 betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 step: 1 weight_decay: 3e-07 ) <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f303160d640> 2020-12-18 20:30:05 | INFO | main | *** Train *** 2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed toSeq2SeqTrainer
,--lr_scheduler
arg is ignored. [INFO|trainer.py:723] 2020-12-18 20:30:05,688 >> ***** Running training ***** [INFO|trainer.py:724] 2020-12-18 20:30:05,688 >> Num examples = 500 [INFO|trainer.py:725] 2020-12-18 20:30:05,688 >> Num Epochs = 1 [INFO|trainer.py:726] 2020-12-18 20:30:05,688 >> Instantaneous batch size per device = 20 [INFO|trainer.py:727] 2020-12-18 20:30:05,688 >> Total train batch size (w. parallel, distributed & accumulation) = 40 [INFO|trainer.py:728] 2020-12-18 20:30:05,688 >> Gradient Accumulation steps = 1 [INFO|trainer.py:729] 2020-12-18 20:30:05,688 >> Total optimization steps = 13 {‘loss’: inf, ‘learning_rate’: 0.0, ‘epoch’: 0.07692307692307693} 92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 12/13 [00:02<00:00, 5.65it/s][INFO|trainer.py:883] 2020-12-18 20:30:08,588 >>Training completed. Do not forget to share your model on huggingface.co/models =)
{‘epoch’: 1.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00, 5.95it/s] [INFO|trainer.py:1247] 2020-12-18 20:30:08,589 >> Saving model checkpoint to output_dir [INFO|trainer.py:1251] 2020-12-18 20:30:08,589 >> Trainer.model is not a
PreTrainedModel
, only saving its state dict. 2020-12-18 20:30:08 | INFO | main | ***** train metrics ***** 2020-12-18 20:30:08 | INFO | main | train_samples_per_second = 172.096 2020-12-18 20:30:08 | INFO | main | train_runtime = 2.9054 2020-12-18 20:30:08 | INFO | main | train_n_ojbs = 500
I know I haven’t provided reproduction info, as I haven’t quite finished working on integration with HF transformers
, but it should be ready soon. I was hoping you could tell from logs what went wrong. But if it isn’t helpful I will update this Issue with reproduction details once I have a transformers branch you could experiment with.
Issue Analytics
- State:
- Created 3 years ago
- Comments:27 (25 by maintainers)
Top GitHub Comments
I think I see the issue, based on your stack trace.
Can you please call
model.backward()
instead ofloss.backward()
? I assume that model is the return value of deepspeed.initialize().We are much appreciating you too offering to support our DS integration process, @g-karthik!