Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

local_rank error

See original GitHub issue

I used the distributed training and follow the way here: https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training.

However, I got local rank argument error:

usage: fairseq-train [-h] [–no-progress-bar] [–log-interval LOG_INTERVAL] [–log-format {json,none,simple,tqdm}] [–tensorboard-logdir TENSORBOARD_LOGDIR] [–wandb-project WANDB_PROJECT] [–seed SEED] [–cpu] [–tpu] [–bf16] [–memory-efficient-bf16] [–fp16] [–memory-efficient-fp16] [–fp16-no-flatten-grads] [–fp16-init-scale FP16_INIT_SCALE] [–fp16-scale-window FP16_SCALE_WINDOW] [–fp16-scale-tolerance FP16_SCALE_TOLERANCE] [–min-loss-scale MIN_LOSS_SCALE] [–threshold-loss-scale THRESHOLD_LOSS_SCALE] [–user-dir USER_DIR] [–empty-cache-freq EMPTY_CACHE_FREQ] [–all-gather-list-size ALL_GATHER_LIST_SIZE] [–model-parallel-size MODEL_PARALLEL_SIZE] [–quantization-config-path QUANTIZATION_CONFIG_PATH] [–profile] [–tokenizer {nltk,space,moses}] [–bpe {bert,bytes,hf_byte_bpe,characters,fastbpe,sentencepiece,subword_nmt,gpt2,byte_bpe}] [–criterion {wav2vec,cross_entropy,nat_loss,sentence_prediction,composite_loss,legacy_masked_lm_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,adaptive_loss,sentence_ranking,ctc,masked_lm,vocab_parallel_cross_entropy}] [–optimizer {adagrad,adamax,adadelta,sgd,adafactor,lamb,nag,adam}] [–lr-scheduler {fixed,inverse_sqrt,tri_stage,triangular,cosine,reduce_lr_on_plateau,polynomial_decay}] [–scoring {sacrebleu,bleu,wer,chrf}] [–task TASK] [–num-workers NUM_WORKERS] [–skip-invalid-size-inputs-valid-test] [–max-tokens MAX_TOKENS] [–batch-size BATCH_SIZE] [–required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [–required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [–dataset-impl {raw,lazy,cached,mmap,fasta}] [–data-buffer-size DATA_BUFFER_SIZE] [–train-subset TRAIN_SUBSET] [–valid-subset VALID_SUBSET] [–validate-interval VALIDATE_INTERVAL] [–validate-interval-updates VALIDATE_INTERVAL_UPDATES] [–validate-after-updates VALIDATE_AFTER_UPDATES] [–fixed-validation-seed FIXED_VALIDATION_SEED] [–disable-validation] [–max-tokens-valid MAX_TOKENS_VALID] [–batch-size-valid BATCH_SIZE_VALID] [–curriculum CURRICULUM] [–gen-subset GEN_SUBSET] [–num-shards NUM_SHARDS] [–shard-id SHARD_ID] [–distributed-world-size DISTRIBUTED_WORLD_SIZE] [–distributed-rank DISTRIBUTED_RANK] [–distributed-backend DISTRIBUTED_BACKEND] [–distributed-init-method DISTRIBUTED_INIT_METHOD] [–distributed-port DISTRIBUTED_PORT] [–device-id DEVICE_ID] [–local-rank LOCAL_RANK] [–distributed-no-spawn] [–ddp-backend {c10d,no_c10d}] [–bucket-cap-mb BUCKET_CAP_MB] [–fix-batches-to-gpus] [–find-unused-parameters] [–fast-stat-sync] [–broadcast-buffers] [–distributed-wrapper {DDP,SlowMo}] [–slowmo-momentum SLOWMO_MOMENTUM] [–slowmo-algorithm SLOWMO_ALGORITHM] [–localsgd-frequency LOCALSGD_FREQUENCY] [–nprocs-per-node NPROCS_PER_NODE] [–pipeline-model-parallel] [–pipeline-balance PIPELINE_BALANCE] [–pipeline-devices PIPELINE_DEVICES] [–pipeline-chunks PIPELINE_CHUNKS] [–pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [–pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [–pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [–pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [–pipeline-checkpoint {always,never,except_last}] [–zero-sharding {none,os}] [–arch ARCH] [–max-epoch MAX_EPOCH] [–max-update MAX_UPDATE] [–stop-time-hours STOP_TIME_HOURS] [–clip-norm CLIP_NORM] [–sentence-avg] [–update-freq UPDATE_FREQ] [–lr LR] [–min-lr MIN_LR] [–use-bmuf] [–save-dir SAVE_DIR] [–restore-file RESTORE_FILE] [–finetune-from-model FINETUNE_FROM_MODEL] [–reset-dataloader] [–reset-lr-scheduler] [–reset-meters] [–reset-optimizer] [–optimizer-overrides OPTIMIZER_OVERRIDES] [–save-interval SAVE_INTERVAL] [–save-interval-updates SAVE_INTERVAL_UPDATES] [–keep-interval-updates KEEP_INTERVAL_UPDATES] [–keep-last-epochs KEEP_LAST_EPOCHS] [–keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [–no-save] [–no-epoch-checkpoints] [–no-last-checkpoints] [–no-save-optimizer-state] [–best-checkpoint-metric BEST_CHECKPOINT_METRIC] [–maximize-best-checkpoint-metric] [–patience PATIENCE] [–checkpoint-suffix CHECKPOINT_SUFFIX] [–checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [–activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [–dropout D] [–attention-dropout D] [–activation-dropout D] [–encoder-embed-path STR] [–encoder-embed-dim N] [–encoder-ffn-embed-dim N] [–encoder-layers N] [–encoder-attention-heads N] [–encoder-normalize-before] [–encoder-learned-pos] [–decoder-embed-path STR] [–decoder-embed-dim N] [–decoder-ffn-embed-dim N] [–decoder-layers N] [–decoder-attention-heads N] [–decoder-learned-pos] [–decoder-normalize-before] [–decoder-output-dim N] [–share-decoder-input-output-embed] [–share-all-embeddings] [–no-token-positional-embeddings] [–adaptive-softmax-cutoff EXPR] [–adaptive-softmax-dropout D] [–layernorm-embedding] [–no-scale-embedding] [–checkpoint-activations] [–no-cross-attention] [–cross-self-attention] [–encoder-layerdrop D] [–decoder-layerdrop D] [–encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [–decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [–quant-noise-pq D] [–quant-noise-pq-block-size D] [–quant-noise-scalar D] [–pooler-dropout D] [–pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [–spectral-norm-classification-head] [-s SRC] [-t TARGET] [–load-alignments] [–left-pad-source BOOL] [–left-pad-target BOOL] [–max-source-positions N] [–max-target-positions N] [–upsample-primary UPSAMPLE_PRIMARY] [–truncate-source] [–num-batch-buckets N] [–eval-bleu] [–eval-bleu-detok EVAL_BLEU_DETOK] [–eval-bleu-detok-args JSON] [–eval-tokenized-bleu] [–eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]] [–eval-bleu-args JSON] [–eval-bleu-print-samples] [–label-smoothing D] [–report-accuracy] [–ignore-prefix-size IGNORE_PREFIX_SIZE] [–adam-betas ADAM_BETAS] [–adam-eps ADAM_EPS] [–weight-decay WEIGHT_DECAY] [–use-old-adam] [–force-anneal N] [–warmup-updates N] [–end-learning-rate END_LEARNING_RATE] [–power POWER] [–total-num-update TOTAL_NUM_UPDATE] [–pad PAD] [–eos EOS] [–unk UNK] data fairseq-train: error: unrecognized arguments: --local_rank=3

It seems that in fairseq it wants --local-rank but in practice, it ran with --local_rank.

Is there any solution to it?

Thanks.

What’s your environment?

fairseq (master: Nov 4, 2020)

PyTorch Version 1.6
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): yes
Build command you used (if compiling from source): pip install
Python version: 3.6

Issue Analytics

State:
Created 3 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

5reactions

magiczixiaocommented, Nov 26, 2021

I encountered the same problem when using “fairseq-hydra-train” to pretrain wav2vec2.0 model: fairseq-hydra-train: error: unrecognized arguments: --local_rank=0 Here are the command: python -m torch.distributed.launch --nproc_per_node=1 \ --nnodes=2 --node_rank=0 --master_addr="192.168.24.42" \ --master_port=12345 \ ./fairseq-hydra-train task.data=my_data_set \ --config-dir ./fairseq-main/examples/wav2vec/config/pretraining \ --config-name my_config Could you give some advice on how can I use fairseq-hydra-train to train on multi-node? Extremely grateful.

1reaction

myleottcommented, Nov 6, 2020

Ah yeah, python -m torch.distributed.launch will only populate --local_rank (with an underscore): https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py#L268

@alexeib, can we add an alias?

Top Results From Across the Web

Why did my Local Rank Tracker report fail to complete?

This is the most common reason for the failure of Local Rank Tracker reports. When we try to set the search location in...

How to Troubleshoot Local Ranking Failures [Updated for 2018]

Miriam Ellis uncovers the most common causes of lost or low local rankings in the Google SERPs and helps you diagnose just what...

Error: unrecognized arguments: --local_rank=1 - distributed

Error: unrecognized arguments : --local_rank=1 I have single machine with two GPUs. This errors occurred when I used this command 'CUDA_VISIBLE_ ...

Local rank conflict when training on multi-node multi-gpu ...

When training using deepspeed stage 3 in a multi-node environment with multiple GPUs on each node, an error regarding the LOCAL_RANK environment ......

What does local rank mean in distributed deep learning?

After reading some materials from distributed computation I guess that local_rank is like an ID for a machine. And 0 may mean this...