Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segfault when training large GPT2 models on single GPU

See original GitHub issue

I’m trying to use DeepSpeed to finetune GPT2 models on a single RTX 3090 GPU. Using the scripts included with huggingface-transformers, I have been able to get it working up through the 774M model, and the ZeRO optimizations enable me to double the batch size. However, the CPU Adam optimizer is segfaulting when I try to train the 1558M model. I am using Ubuntu 20.04, CUDA 11.2, Nvidia drivers 460.32.03, and current git master versions of PyTorch, Transformers, and DeepSpeed.

Here is the script I used:

export BATCH_SIZE=1

export CUDA_VISIBLE_DEVICES=0
export CUDA_HOME=/usr/local/cuda-11.2
export TOKENIZERS_PARALLELISM=false
export MP_SIZE=1
export NUM_WORKERS=1
export NUM_GPUS_PER_WORKER=1

rm -r test_output

USE_TF=0 deepspeed --num_gpus=1 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size $BATCH_SIZE --per_device_eval_batch_size $BATCH_SIZE --fp16 --deepspeed ds_config.json

pofo-corpus.txt is the Poetry Foundation collection in a single text file (around 18MB). Here is the config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 100,
        "hysteresis": 2,
        "min_loss_scale": 1e-24,
        "initial_scale_power": -2
    },

    "zero_allow_untested_optimizer": true,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 1.8e7,
        "reduce_scatter": true,
        "reduce_bucket_size": 1.8e7,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 1e-6,
            "warmup_max_lr": 5e-5,
            "warmup_num_steps": 500
        }
    }
}

I’ve messed around with a bunch of the settings, but none of them seem to affect the issue. Here is the output:

rm: cannot remove 'test_output': No such file or directory
[2021-01-18 14:10:29,800] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-18 14:10:29,815] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --fp16 --deepspeed ds_config.json
[2021-01-18 14:10:30,261] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0]}
[2021-01-18 14:10:30,261] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-01-18 14:10:30,261] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-01-18 14:10:30,261] [INFO] [launch.py:100:main] dist_world_size=1
[2021-01-18 14:10:30,261] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-01-18 14:10:31,069] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
Using custom data configuration default
Reusing dataset text (/home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,547 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,547 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,624 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,625 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/jechk/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/jechk/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/jechk/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1027] 2021-01-18 14:10:32,129 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/jechk/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1143] 2021-01-18 14:10:54,131 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1151] 2021-01-18 14:10:54,131 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-d5e960aa227f7b5e.arrow
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-1b9b3e2f092a373d.arrow
[INFO|trainer.py:442] 2021-01-18 14:10:55,458 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:359] 2021-01-18 14:10:55,458 >> Using amp fp16 backend
[INFO|integrations.py:323] 2021-01-18 14:10:55,459 >> Keeping the `scheduler` config from ds_config.json intact, ignoring any scheduler-specific cl args
[INFO|integrations.py:368] 2021-01-18 14:10:55,459 >> Keeping the `fp16` config from ds_config.json intact, ignoring any fp16-specific cl args
[2021-01-18 14:10:55,459] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.10+7b07e12, git-hash=7b07e12, git-branch=master
[2021-01-18 14:10:55,472] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jechk/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.20410561561584473 seconds
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-01-18 14:10:57,968] [INFO] [engine.py:540:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-01-18 14:10:57,968] [INFO] [engine.py:545:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 5e-05
    weight_decay: 0.0
)
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-01-18 14:10:57,968] [INFO] [engine.py:661:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/jechk/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.1086723804473877 seconds
[2021-01-18 14:10:58,077] [INFO] [stage2.py:130:__init__] Reduce bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:131:__init__] Allgather bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:132:__init__] CPU Offload: True
group 0 param 0 = 1557611200
[2021-01-18 14:11:03,591] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-01-18 14:11:03,591] [INFO] [engine.py:575:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fe0f0994d30>
[2021-01-18 14:11:03,591] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fdfbc497ee0>
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2021-01-18 14:11:03,591] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fe0f0994a60>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   amp_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   dump_state ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 0.25, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1e-24}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fe0f0994ac0>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   fp16_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 0.25
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pld_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 500}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   steps_per_print .............. 10
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   train_batch_size ............. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   wall_clock_breakdown ......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   world_size ................... 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 18000000.0,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "cpu_offload": true,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": true,
    "reduce_bucket_size": 18000000.0,
    "reduce_scatter": true,
    "stage": 2
}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-01-18 14:11:03,592] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "initial_scale_power":-2,
        "loss_scale":0,
        "loss_scale_window":100,
        "min_loss_scale":1e-24
    },
    "gradient_accumulation_steps":1,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":5e-05,
            "weight_decay":0.0
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":5e-05,
            "warmup_min_lr":1e-06,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "train_micro_batch_size_per_gpu":1,
    "zero_allow_untested_optimizer":true,
    "zero_optimization":{
        "allgather_bucket_size":18000000.0,
        "allgather_partitions":true,
        "contiguous_gradients":true,
        "cpu_offload":true,
        "overlap_comm":true,
        "reduce_bucket_size":18000000.0,
        "reduce_scatter":true,
        "stage":2
    }
}
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028705596923828125 seconds
[INFO|trainer.py:810] 2021-01-18 14:11:03,643 >> ***** Running training *****
[INFO|trainer.py:811] 2021-01-18 14:11:03,643 >>   Num examples = 4917
[INFO|trainer.py:812] 2021-01-18 14:11:03,643 >>   Num Epochs = 3
[INFO|trainer.py:813] 2021-01-18 14:11:03,643 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:814] 2021-01-18 14:11:03,643 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:815] 2021-01-18 14:11:03,643 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:816] 2021-01-18 14:11:03,643 >>   Total optimization steps = 14751
2021-01-18 14:11:03.737646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  0%|                                                                                                 | 0/14751 [00:00<?, ?it/s][W reducer.cpp:1042] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())

The program then exits abruptly. The segfault is reported in dmesg:

[ 9250.120732] python3[10345]: segfault at 7fde4685c850 ip 00007fdfbc2057e0 sp 00007fdf8cd3fe40 error 6
[ 9250.120738] python3[10349]: segfault at 7fde846a9f70 ip 00007fdfbc2057e0 sp 00007fdf8ad3be40 error 6
[ 9250.120743] python3[10344]: segfault at 7fde370c9288 ip 00007fdfbc2057e0 sp 00007fdfbcce3e40 error 6
[ 9250.120745] python3[10348]: segfault at 7fde74f169a8 ip 00007fdfbc2057e0 sp 00007fdf8b53ce40 error 6
[ 9250.120749] python3[10347]: segfault at 7fde657833e0 ip 00007fdfbc2057e0 sp 00007fdf8bd3de40 error 6
[ 9250.120752]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120754]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120755]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120761] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120763] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120764] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120766]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120767]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120772] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120778] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff

I tried training similar models using the DeepSpeed version of Megatron-LM instead of huggingface-transformers, and the same thing happens–it works correctly up through a certain number of parameters, but it segfaults with sufficiently large models.

Issue Analytics

State:
Created 3 years ago
Comments:14 (6 by maintainers)

Top GitHub Comments

3reactions

jeffbindercommented, Feb 8, 2021

I read the code a bit more carefully and I think I can see why this segfault is only happening on AMD systems. If AVX512 or AVX256 instructions are available, Adam_Optimizer::Step copies the data in blocks, one TILE at a time, and then runs an extra loop to copy the remainder. If __AVX512__ and __AVX256__ are undefined (which appears to be the case on my system), then it just uses that last loop to copy all the data. But that loop tries to store the parameters in one half of _doubled_buffer, which is not big enough to handle models that exceed the size of TILE.

The way this loop works looks like it results in a difference in behavior between Intel and AMD. In the AVX code, launch_param_update is called once every TILE. However, without AVX, it only ends up being called once at the end, regardless of how big the parameter size is.

My hacky solution of changing the value of TILE is not the right answer, but I wonder if it might be possible to save some ~~GPU~~CPU memory in addition to fixing the segfault by changing the way the buffer is allocated.

There also appears to be an issue with how the availability of AVX instructions is being determined. I have a Zen 2 processor that is supposed to have AVX256, but it doesn’t seem to be detected. I’m not sure if that’s an issue with DeepSpeed or a configuration problem on my end, though.

1reaction

jeffbindercommented, Feb 8, 2021

Here is a diff:

--- a/csrc/includes/cpu_adam.h
+++ b/csrc/includes/cpu_adam.h
@@ -20,7 +20,7 @@
         }                                                                                      \
     }
 
-#define TILE (1024 * 1024 * 1024)
+#define TILE 1557611200
 
 #if defined(__AVX512__)
 #define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)

1557611200 is the number of parameters in gpt2-xl. In itself, this patch isn’t really a solution because it’s specific to one model. If you’re training another large model, you’d have to change it to a different number.

I’ll see if I can get a proper stack trace when I have the time. The crash was, I believe, occurring in Adam_Optimizer::Step at cpu_adam.cpp:134.

Top Results From Across the Web

Efficient Training on a Single GPU - Hugging Face

This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...

Tensorflow segmentation fault with single machine multiple ...

I believe that what I really do is to collect and average gradients from multiple GPUs and then, update the parameters in my...

huggingface gpt2 example - Aesthetica Beauty Center

I am trying to train huggingface's implementation of the GPT2 model from scratch (meaning I am using their architecture but not using pre-trained...

Segmentation Fault - Notebook - Jupyter Community Forum

When a large dataset is getting trained, with the higher number of epochs, the kernel dies and doesn't get restarted. So, what would...

Better Language Models and Their Implications - OpenAI

GPT-2 is a large transformer-based language model with 1.5 billion ... these tasks from the raw text, using no task-specific training data.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Segfault when training large GPT2 models on single GPU

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

AttributeError: 'NoneType' object has no attribute 'type' with overlap_comm=True and zero 2

Are ZeRO CPU offload and gradient accumulation compatible?