Segfault when training large GPT2 models on single GPU
See original GitHub issueI’m trying to use DeepSpeed to finetune GPT2 models on a single RTX 3090 GPU. Using the scripts included with huggingface-transformers, I have been able to get it working up through the 774M model, and the ZeRO optimizations enable me to double the batch size. However, the CPU Adam optimizer is segfaulting when I try to train the 1558M model. I am using Ubuntu 20.04, CUDA 11.2, Nvidia drivers 460.32.03, and current git master versions of PyTorch, Transformers, and DeepSpeed.
Here is the script I used:
export BATCH_SIZE=1
export CUDA_VISIBLE_DEVICES=0
export CUDA_HOME=/usr/local/cuda-11.2
export TOKENIZERS_PARALLELISM=false
export MP_SIZE=1
export NUM_WORKERS=1
export NUM_GPUS_PER_WORKER=1
rm -r test_output
USE_TF=0 deepspeed --num_gpus=1 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size $BATCH_SIZE --per_device_eval_batch_size $BATCH_SIZE --fp16 --deepspeed ds_config.json
pofo-corpus.txt is the Poetry Foundation collection in a single text file (around 18MB). Here is the config file:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 100,
"hysteresis": 2,
"min_loss_scale": 1e-24,
"initial_scale_power": -2
},
"zero_allow_untested_optimizer": true,
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1.8e7,
"reduce_scatter": true,
"reduce_bucket_size": 1.8e7,
"overlap_comm": true,
"contiguous_gradients": true,
"cpu_offload": true
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 1e-6,
"warmup_max_lr": 5e-5,
"warmup_num_steps": 500
}
}
}
I’ve messed around with a bunch of the settings, but none of them seem to affect the issue. Here is the output:
rm: cannot remove 'test_output': No such file or directory
[2021-01-18 14:10:29,800] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-18 14:10:29,815] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --fp16 --deepspeed ds_config.json
[2021-01-18 14:10:30,261] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0]}
[2021-01-18 14:10:30,261] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-01-18 14:10:30,261] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-01-18 14:10:30,261] [INFO] [launch.py:100:main] dist_world_size=1
[2021-01-18 14:10:30,261] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-01-18 14:10:31,069] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
Using custom data configuration default
Reusing dataset text (/home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,547 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,547 >> Model config GPT2Config {
"_num_labels": 1,
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0
},
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.3.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,624 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,625 >> Model config GPT2Config {
"_num_labels": 1,
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0
},
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.3.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/jechk/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/jechk/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/jechk/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1027] 2021-01-18 14:10:32,129 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/jechk/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1143] 2021-01-18 14:10:54,131 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:1151] 2021-01-18 14:10:54,131 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-d5e960aa227f7b5e.arrow
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-1b9b3e2f092a373d.arrow
[INFO|trainer.py:442] 2021-01-18 14:10:55,458 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:359] 2021-01-18 14:10:55,458 >> Using amp fp16 backend
[INFO|integrations.py:323] 2021-01-18 14:10:55,459 >> Keeping the `scheduler` config from ds_config.json intact, ignoring any scheduler-specific cl args
[INFO|integrations.py:368] 2021-01-18 14:10:55,459 >> Keeping the `fp16` config from ds_config.json intact, ignoring any fp16-specific cl args
[2021-01-18 14:10:55,459] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.10+7b07e12, git-hash=7b07e12, git-branch=master
[2021-01-18 14:10:55,472] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jechk/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.20410561561584473 seconds
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-01-18 14:10:57,968] [INFO] [engine.py:540:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-01-18 14:10:57,968] [INFO] [engine.py:545:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
amsgrad: False
betas: [0.9, 0.999]
bias_correction: True
eps: 1e-08
lr: 5e-05
weight_decay: 0.0
)
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-01-18 14:10:57,968] [INFO] [engine.py:661:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/jechk/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.1086723804473877 seconds
[2021-01-18 14:10:58,077] [INFO] [stage2.py:130:__init__] Reduce bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:131:__init__] Allgather bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:132:__init__] CPU Offload: True
group 0 param 0 = 1557611200
[2021-01-18 14:11:03,591] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-01-18 14:11:03,591] [INFO] [engine.py:575:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fe0f0994d30>
[2021-01-18 14:11:03,591] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fdfbc497ee0>
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2021-01-18 14:11:03,591] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fe0f0994a60>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] allreduce_always_fp32 ........ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] amp_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] amp_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] checkpoint_tag_validation_enabled True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] checkpoint_tag_validation_fail False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] disable_allgather ............ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] dump_state ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] dynamic_loss_scale_args ...... {'init_scale': 0.25, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1e-24}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] elasticity_enabled ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fe0f0994ac0>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] fp16_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] global_rank .................. 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] gradient_accumulation_steps .. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] gradient_clipping ............ 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] gradient_predivide_factor .... 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] initial_dynamic_scale ........ 0.25
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] loss_scale ................... 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] memory_breakdown ............. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] optimizer_legacy_fusion ...... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] optimizer_name ............... adamw
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] pld_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] pld_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] prescale_gradients ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] scheduler_name ............... WarmupLR
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 500}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] sparse_attention ............. None
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] sparse_gradients_enabled ..... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] steps_per_print .............. 10
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] tensorboard_enabled .......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] tensorboard_output_path ......
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] train_batch_size ............. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] train_micro_batch_size_per_gpu 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] wall_clock_breakdown ......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] world_size ................... 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] zero_allow_untested_optimizer True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] zero_config .................. {
"allgather_bucket_size": 18000000.0,
"allgather_partitions": true,
"contiguous_gradients": true,
"cpu_offload": true,
"elastic_checkpoint": true,
"load_from_fp32_weights": true,
"overlap_comm": true,
"reduce_bucket_size": 18000000.0,
"reduce_scatter": true,
"stage": 2
}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] zero_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print] zero_optimization_stage ...... 2
[2021-01-18 14:11:03,592] [INFO] [config.py:739:print] json = {
"fp16":{
"enabled":true,
"hysteresis":2,
"initial_scale_power":-2,
"loss_scale":0,
"loss_scale_window":100,
"min_loss_scale":1e-24
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"optimizer":{
"params":{
"betas":[
0.9,
0.999
],
"eps":1e-08,
"lr":5e-05,
"weight_decay":0.0
},
"type":"AdamW"
},
"scheduler":{
"params":{
"warmup_max_lr":5e-05,
"warmup_min_lr":1e-06,
"warmup_num_steps":500
},
"type":"WarmupLR"
},
"train_micro_batch_size_per_gpu":1,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"allgather_bucket_size":18000000.0,
"allgather_partitions":true,
"contiguous_gradients":true,
"cpu_offload":true,
"overlap_comm":true,
"reduce_bucket_size":18000000.0,
"reduce_scatter":true,
"stage":2
}
}
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028705596923828125 seconds
[INFO|trainer.py:810] 2021-01-18 14:11:03,643 >> ***** Running training *****
[INFO|trainer.py:811] 2021-01-18 14:11:03,643 >> Num examples = 4917
[INFO|trainer.py:812] 2021-01-18 14:11:03,643 >> Num Epochs = 3
[INFO|trainer.py:813] 2021-01-18 14:11:03,643 >> Instantaneous batch size per device = 1
[INFO|trainer.py:814] 2021-01-18 14:11:03,643 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:815] 2021-01-18 14:11:03,643 >> Gradient Accumulation steps = 1
[INFO|trainer.py:816] 2021-01-18 14:11:03,643 >> Total optimization steps = 14751
2021-01-18 14:11:03.737646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
0%| | 0/14751 [00:00<?, ?it/s][W reducer.cpp:1042] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
The program then exits abruptly. The segfault is reported in dmesg:
[ 9250.120732] python3[10345]: segfault at 7fde4685c850 ip 00007fdfbc2057e0 sp 00007fdf8cd3fe40 error 6
[ 9250.120738] python3[10349]: segfault at 7fde846a9f70 ip 00007fdfbc2057e0 sp 00007fdf8ad3be40 error 6
[ 9250.120743] python3[10344]: segfault at 7fde370c9288 ip 00007fdfbc2057e0 sp 00007fdfbcce3e40 error 6
[ 9250.120745] python3[10348]: segfault at 7fde74f169a8 ip 00007fdfbc2057e0 sp 00007fdf8b53ce40 error 6
[ 9250.120749] python3[10347]: segfault at 7fde657833e0 ip 00007fdfbc2057e0 sp 00007fdf8bd3de40 error 6
[ 9250.120752] in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120754] in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120755] in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120761] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120763] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120764] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120766] in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120767] in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120772] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120778] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff
I tried training similar models using the DeepSpeed version of Megatron-LM instead of huggingface-transformers, and the same thing happens–it works correctly up through a certain number of parameters, but it segfaults with sufficiently large models.
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (6 by maintainers)
Top GitHub Comments
I read the code a bit more carefully and I think I can see why this segfault is only happening on AMD systems. If AVX512 or AVX256 instructions are available,
Adam_Optimizer::Step
copies the data in blocks, oneTILE
at a time, and then runs an extra loop to copy the remainder. If__AVX512__
and__AVX256__
are undefined (which appears to be the case on my system), then it just uses that last loop to copy all the data. But that loop tries to store the parameters in one half of_doubled_buffer
, which is not big enough to handle models that exceed the size ofTILE
.The way this loop works looks like it results in a difference in behavior between Intel and AMD. In the AVX code,
launch_param_update
is called once everyTILE
. However, without AVX, it only ends up being called once at the end, regardless of how big the parameter size is.My hacky solution of changing the value of
TILE
is not the right answer, but I wonder if it might be possible to save someGPUCPU memory in addition to fixing the segfault by changing the way the buffer is allocated.There also appears to be an issue with how the availability of AVX instructions is being determined. I have a Zen 2 processor that is supposed to have AVX256, but it doesn’t seem to be detected. I’m not sure if that’s an issue with DeepSpeed or a configuration problem on my end, though.
Here is a diff:
1557611200 is the number of parameters in gpt2-xl. In itself, this patch isn’t really a solution because it’s specific to one model. If you’re training another large model, you’d have to change it to a different number.
I’ll see if I can get a proper stack trace when I have the time. The crash was, I believe, occurring in Adam_Optimizer::Step at cpu_adam.cpp:134.