question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] RuntimeError: Ninja is required to load C++ extensions

See original GitHub issue

Hi,

I am getting the following error when running pretrain_gpt.sh


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja … [OKAY]

op name … installed … compatible

cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_adam … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] sparse_attn … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io … [NO] … [NO] transformer_inference … [NO] … [OKAY] utils … [NO] … [OKAY] quantizer … [NO] … [OKAY]

DeepSpeed general environment info: torch install path … [‘/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch’] torch version … 1.8.2+cu111 torch cuda version … 11.1 nvcc version … 11.1 deepspeed install path … [‘/qfs/people/shar703/scripts/mega_ai/deepspeed_megatron/DeepSpeed/deepspeed’] deepspeed info … 0.5.9+1d295ff, 1d295ff, master deepspeed wheel compiled w. … torch 1.8, cuda 11.1 **** Git info for Megatron: git_hash=1ac4a44 git_branch=main **** using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters … ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 … False adam_beta1 … 0.9 adam_beta2 … 0.999 adam_eps … 1e-08 adlr_autoresume … False adlr_autoresume_interval … 1000 apply_query_key_layer_scaling … True apply_residual_connection_post_layernorm … False attention_dropout … 0.1 attention_softmax_in_fp32 … False bert_binary_head … True bert_load … None bf16 … False bias_dropout_fusion … True bias_gelu_fusion … True biencoder_projection_dim … 0 biencoder_shared_query_context_model … False block_data_path … None checkpoint_activations … True checkpoint_in_cpu … False checkpoint_num_layers … 1 clip_grad … 1.0 consumed_train_samples … 0 consumed_train_tokens … 0 consumed_valid_samples … 0 contigious_checkpointing … False cpu_optimizer … False cpu_torch_adam … False curriculum_learning … False data_impl … infer data_parallel_size … 1 data_path … [‘cord19/chemistry_cord19_abstract_document’] dataloader_type … single DDP_impl … local decoder_seq_length … None deepscale … False deepscale_config … None deepspeed … False deepspeed_activation_checkpointing … False deepspeed_config … None deepspeed_mpi … False distribute_checkpointed_activations … False distributed_backend … nccl embedding_path … None encoder_seq_length … 1024 eod_mask_loss … False eval_interval … 100 eval_iters … 10 evidence_data_path … None exit_duration_in_mins … None exit_interval … None ffn_hidden_size … 4096 finetune … False fp16 … True fp16_lm_cross_entropy … False fp32_residual_connection … False global_batch_size … 8 hidden_dropout … 0.1 hidden_size … 1024 hysteresis … 2 ict_head_size … None ict_load … None img_dim … 224 indexer_batch_size … 128 indexer_log_interval … 1000 init_method_std … 0.02 init_method_xavier_uniform … False initial_loss_scale … 4294967296 kv_channels … 64 layernorm_epsilon … 1e-05 lazy_mpu_init … None load … checkpoints/gpt2_345m local_rank … None log_batch_size_to_tensorboard … False log_interval … 10 log_learning_rate_to_tensorboard … True log_loss_scale_to_tensorboard … True log_num_zeros_in_grad … False log_params_norm … False log_timers_to_tensorboard … False log_validation_ppl_to_tensorboard … False loss_scale … None loss_scale_window … 1000 lr … 0.00015 lr_decay_iters … 320000 lr_decay_samples … None lr_decay_style … cosine lr_decay_tokens … None lr_warmup_fraction … 0.01 lr_warmup_iters … 0 lr_warmup_samples … 0 make_vocab_size_divisible_by … 128 mask_prob … 0.15 masked_softmax_fusion … True max_position_embeddings … 1024 memory_centric_tiled_linear … False merge_file … …/deepspeed_megatron/gpt_files/gpt2-merges.txt micro_batch_size … 4 min_loss_scale … 1.0 min_lr … 0.0 mmap_warmup … False no_load_optim … None no_load_rng … None no_save_optim … None no_save_rng … None num_attention_heads … 16 num_channels … 3 num_classes … 1000 num_layers … 24 num_layers_per_virtual_pipeline_stage … None num_workers … 2 onnx_safe … None openai_gelu … False optimizer … adam override_lr_scheduler … False params_dtype … torch.float16 partition_activations … False patch_dim … 16 pipeline_model_parallel_size … 1 profile_backward … False query_in_block_prob … 0.1 rampup_batch_size … None rank … 0 remote_device … none reset_attention_mask … False reset_position_ids … False retriever_report_topk_accuracies … [] retriever_score_scaling … False retriever_seq_length … 256 sample_rate … 1.0 save … checkpoints/gpt2_345m save_interval … 500 scatter_gather_tensors_in_pipeline … True scattered_embeddings … False seed … 1234 seq_length … 1024 sgd_momentum … 0.9 short_seq_prob … 0.1 split … 969, 30, 1 split_transformers … False synchronize_each_layer … False tensor_model_parallel_size … 1 tensorboard_dir … None tensorboard_log_interval … 1 tensorboard_queue_size … 1000 tile_factor … 1 titles_data_path … None tokenizer_type … GPT2BPETokenizer train_iters … 500000 train_samples … None train_tokens … None use_checkpoint_lr_scheduler … False use_contiguous_buffers_in_ddp … False use_cpu_initialization … None use_one_sent_docs … False use_pin_memory … False virtual_pipeline_model_parallel_size … None vocab_extra_ids … 0 vocab_file … …/deepspeed_megatron/gpt_files/gpt2-vocab.json weight_decay … 0.01 world_size … 1 zero_allgather_bucket_size … 0.0 zero_contigious_gradients … False zero_reduce_bucket_size … 0.0 zero_reduce_scatter … False zero_stage … 1.0 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2

building GPT2BPETokenizer tokenizer … padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed … initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 … initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 compiling dataset index builder … make: Entering directory /qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data' make: Nothing to be done for default’. make: Leaving directory `/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data’

done with dataset index builder. Compilation time: 0.051 seconds compiling and loading fused kernels … Traceback (most recent call last): File “/people/shar703/anaconda3/envs/deepspeed/bin/ninja”, line 33, in <module> sys.exit(load_entry_point(‘ninja’, ‘console_scripts’, ‘ninja’)()) File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/init.py”, line 51, in ninja raise SystemExit(_program(‘ninja’, sys.argv[1:])) File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/init.py”, line 47, in _program return subprocess.call([os.path.join(BIN_DIR, name)] + args) File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py”, line 340, in call with Popen(*popenargs, **kwargs) as p: File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py”, line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py”, line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) PermissionError: [Errno 13] Permission denied: ‘/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/data/bin/ninja’ Traceback (most recent call last): File “pretrain_gpt.py”, line 231, in <module> pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File “/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py”, line 96, in pretrain initialize_megatron(extra_args_provider=extra_args_provider, File “/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py”, line 89, in initialize_megatron _compile_dependencies() File “/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py”, line 137, in _compile_dependencies fused_kernels.load(args) File “/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/init.py”, line 71, in load scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper( File “/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/init.py”, line 47, in _cpp_extention_load_helper return cpp_extension.load( File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 1079, in load return _jit_compile( File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 1292, in _jit_compile _write_ninja_file_and_build_library( File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 1373, in _write_ninja_file_and_build_library verify_ninja_availability() File “/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py”, line 1429, in verify_ninja_availability raise RuntimeError(“Ninja is required to load C++ extensions”) RuntimeError: Ninja is required to load C++ extensions

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
chinollcommented, May 18, 2022

@chinoll I have a similar problem. Where do you exaclty add the …/bin/ninja path in the torch/utils/cpp_extension.py file?

image

1reaction
chinollcommented, May 16, 2022

A temporary solution is to manually add the path of ninja to the PATH environment variable in the torch/utils/cpp_extension.py file

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ninja is required to load C++ extensions - Stack Overflow
I download a program and want to run it. It's from github. I enter the experiments/segmentation/ folder, and input "python test.py --dataset ...
Read more >
RuntimeError: Ninja is required to load C++ extension #167
Hi,author. I have followed the instructions on your page. I got your code by git clone, and run "python setup.py install" with no...
Read more >
"RuntimeError: Ninja is required to load C++ extension" BUT ...
As the title suggest, I get this error when the Streamlit app calls a function which uses a library dependent on PyTorch.
Read more >
Custom C++ and CUDA Extensions - PyTorch
BuildExtension performs a number of required configuration steps and checks and also manages mixed compilation in the case of mixed C++/CUDA extensions. And ......
Read more >
Source code for torch.utils.cpp_extension - MMCV
if not is_ninja_available(): raise RuntimeError("Ninja is required to load C++ extensions") def _prepare_ldflags(extra_ldflags, with_cuda, verbose, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found