Model_parallel=2 and 2 gpus on FAIR cluster: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)
See original GitHub issue🐛 Bug
Running a model parallel with 2 gpus on FAIR cluster raises the following exception with the 1.3B_gptz model: UPDATE: When we use model_parallel=2 and 8 gpus, this works, but it should not fail with 2 gpus.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_index_select)
I found that there is a warning in the log which might be giving a clue about the problem – the full log is at the bottom of the issue.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
- Login to FAIR cluster
- The environment is setup with the following steps
- apex commit: e1aa1fc1316a84e66869666270941265ec9cde77
- fairscale commit: 1bc96fa8c69def6d990e42bfbd75f86146ce29bd
- megatron:
--branch fairseq_v2
- metaseq -
git checkout tbmihaylov/gshard-eval-script
- this is rebased from main with added the model (below)
- Model - fresh copy of the 1.3B_gptz from azure:
UNIDIR_LM_ROBERTA_DATA = {
# ...
"1.3B_gptz_model_parallel": gptz_sharded_config(
"/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt",
model_parallel_size=2
),
# ...
}
- Slurm allocation
srun --gpus=2 --nodes 1 --ntasks-per-node 1 --cpus-per-task 10 --mem 58G --constraint volta32gb --time 1440 --partition xlmg,devaccel,learnaccel --pty bash
- Command:
export RUN_MODEL_NAME=1.3B_gptz_model_parallel
python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME} --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp --distributed-world-size 2
- See the error in the log (at the end of this issue).
Expected behavior
Not failing in the given configuration.
Environment
- Explained in the repro
Additional context
Full error log:
(metaseq_20220328) tbmihaylov@learnfair1844:~/metaseq-internal$ python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME} --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp --distributed-world-size 2 | tee debug.log
model_name=1.3B_gptz_model_parallel
args:Namespace(add_bos_token=False, all_gather_list_size=16384, azureml_logging=False, batch_size=None, batch_size_valid=None, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, combine_valid_subsets=None, context_window=0, cpu=False, cpu_offload=False, criterion='cross_entropy', data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=10791, distributed_rank=0, distributed_world_size=2, dont_log_param_and_grad_norm=False, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, future_target=False, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=False, log_file=None, log_format=None, log_interval=100, log_nvidia_smi=False, lr_scheduler='fixed', max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_valid_steps=None, memory_efficient_fp16=True, min_loss_scale=0.0001, model_overrides='{}', model_parallel_size=1, new_profiler=False, no_progress_bar=False, no_reshard_after_forward=False, no_seed_provided=False, num_shards=1, num_workers=1, num_workers_valid=0, optimizer=None, output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, pad_to_fixed_bsz=False, pad_to_fixed_length=False, past_target=False, path=None, plasma_path='/tmp/plasma', profile=False, required_batch_size_multiple=8, results_path=None, sample_break_mode='none', score_sequences=False, seed=1, self_target=False, shard_id=0, shorten_data_split_list='', shorten_method='none', shuffle_docs=False, skip_invalid_size_inputs_valid_test=False, softmax_batch=9223372036854775807, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, train_subset='train', use_plasma_view=False, use_sharded_state=True, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=4000, zero_sharding='none')
model_config:{'model_path': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt', 'extra_args': ['--use-sharded-state', '--memory-efficient-fp16', '--fp16', '--distributed-port', '10791', '--ddp-backend', 'fully_sharded'], 'model_overrides': {'bpe': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True, 'specify_arch': True, 'batch_size': None, 'batch_size_valid': None}, 'model_parallel_size': 2, 'distributed_world_size': 2}
fairseq_cfg.common.model_parallel_size:2
distributed_training.distributed_port=10791
> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
INFO:fairseq.checkpoint_utils:Done loading state dict
INFO:fairseq.models.fairseq_model:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': '/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 2, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False, 'new_profiler': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 64, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://hpc-pg0-132:18422', 'distributed_port': 18422, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None}, 'dataset': {'_name': None, 'num_workers': 8, 'num_workers_valid': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': True, 'validate_interval': 1, 'validate_interval_updates': 1000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 286102, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'clip_norm_type': 'l2', 'skip_gradient_update_on_clip_norm': False, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0002], 'stop_min_lr': -1.0, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_best_checkpoints': True, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '-model_part-0', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': True, 's3_upload_path': 'https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', 'model_parallel_size': 2}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 64}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='transformer_lm_megatron', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=2048, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fd829da7a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'task': {'_name': 'streaming_language_modeling', 'data': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'end_of_document_symbol': '</s>', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'max_source_positions': None, 'max_target_positions': None, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'data_buffer_size': 10, 'tpu': False, 'update_freq': [1]}, 'criterion': Namespace(_name='vocab_parallel_cross_entropy', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fd829da7a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.95)', 'adam_eps': 1e-08, 'weight_decay': 0.1, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0002], 'block_wise': False}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 357, 'force_anneal': None, 'end_learning_rate': 2e-05, 'zero_lr_warmup_steps': 0, 'power': 1.0, 'total_num_update': 286102.0, 'lr': [0.0002]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True}, 'tokenizer': None, 'simul_type': None}
Loading extension module fused_mix_prec_layer_norm_cuda...
name decoder.embed_tokens.weight parameters Parameter containing:
tensor([[ 0.0014, -0.0082, -0.0032, ..., -0.0111, 0.0054, 0.0015],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0050, 0.0010, 0.0044, ..., 0.0003, -0.0001, -0.0035],
...,
[ 0.0159, 0.0042, 0.0066, ..., 0.0044, 0.0008, -0.0086],
[-0.0008, 0.0032, -0.0032, ..., -0.0060, 0.0036, 0.0086],
[-0.0092, -0.0037, -0.0013, ..., 0.0073, 0.0092, -0.0132]],
requires_grad=True)
name decoder.embed_positions.weight parameters Parameter containing:
tensor([[-7.6732e-03, -5.4649e-03, -4.2956e-03, ..., 7.5325e-03,
7.7163e-03, 1.0300e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
[-2.3755e-03, 2.4894e-03, 1.4279e-05, ..., -8.2043e-03,
-1.8271e-02, 3.9899e-03],
...,
[-9.6320e-03, -8.2788e-03, -4.1433e-03, ..., -6.7774e-03,
6.1964e-03, -5.3095e-03],
[-4.4763e-03, 1.4532e-02, -6.0640e-04, ..., 1.5341e-03,
-1.8106e-03, -5.6959e-04],
[ 3.7042e-03, 5.2186e-03, -1.1615e-02, ..., -1.0039e-02,
-8.7586e-04, 7.5653e-03]], requires_grad=True)
name decoder.layers.0._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0059, 0.0019, -0.0075, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.1._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0023, -0.0028, 0.0170, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.2._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0030, -0.0005, 0.0028, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.3._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0077, -0.0097, 0.0007, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.4._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0011, 0.0143, -0.0066, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.5._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025, -0.0069, 0.0071, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.6._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0018, 0.0052, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.7._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0046, -0.0019, -0.0044, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.8._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0011, 0.0047, 0.0105, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.9._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0011, 0.0014, 0.0070, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.10._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0068, 0.0033, -0.0046, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.11._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, 0.0013, 0.0011, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.12._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-2.7278e-03, 7.8808e-03, 6.6479e-05, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00], requires_grad=True)
name decoder.layers.13._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0012, 0.0047, -0.0049, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.14._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0065, 0.0002, 0.0080, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.15._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0017, 0.0030, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.16._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025, 0.0132, -0.0027, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.17._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0027, 0.0103, -0.0090, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.18._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0067, -0.0047, 0.0028, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.19._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0075, 0.0114, -0.0037, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.20._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0069, 0.0069, 0.0075, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.21._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0037, 0.0070, 0.0135, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.22._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0019, 0.0082, -0.0061, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layers.23._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0134, 0.0073, 0.0100, ..., 0.0000, 0.0000, 0.0000],
requires_grad=True)
name decoder.layer_norm.weight parameters Parameter containing:
tensor([1., 1., 1., ..., 1., 1., 1.], requires_grad=True)
name decoder.layer_norm.bias parameters Parameter containing:
tensor([0., 0., 0., ..., 0., 0., 0.], requires_grad=True)
Loaded model
model_loading_time=41.0 seconds
model_loading_time_cuda=41.6 seconds
Inferring max tokens for model...
Traceback (most recent call last):
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 893, in <module>
cli_main()
File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 56, in cli_main
run_evaluations_from_model_name(**vars(args))
File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 320, in run_evaluations_from_model_name
results = load_lm_and_run_func(run_evaluations, model_name, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 178, in load_lm_and_run_func
distributed_utils.call_main(
File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 215, in call_main
torch.multiprocessing.spawn(
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 199, in distributed_main
main(cfg, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 261, in _load_lm_and_run_func
max_tokens = get_or_infer_max_tokens(model, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 378, in get_or_infer_max_tokens
return infer_max_tokens_before_oom(model)
File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 416, in infer_max_tokens_before_oom
while not is_max_tokens_oom(candidate_max_tokens):
File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 409, in is_max_tokens_oom
raise e
File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 405, in is_max_tokens_oom
model.score(input_texts, batch_size=local_bsz, batch_by_size=False)
File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 198, in score
for hypos in self.generate(
File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 253, in generate
translations = self.task.inference_step(
File "/private/home/tbmihaylov/metaseq/fairseq/tasks/language_modeling_inference_for_models_trained_with_streaming.py", line 387, in inference_step
return generator.generate(
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/sequence_scorer.py", line 63, in generate
decoder_out = model(**net_input)
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1403, in forward
outputs = self.module(*args, **kwargs)
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/misc/flatten_params_wrapper.py", line 487, in forward
return self.module(*inputs, **kwinputs)
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/models/fairseq_model.py", line 373, in forward
return self.decoder(src_tokens, **kwargs)
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 643, in forward
x, extra = self.extract_features(
File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 668, in extract_features
return self.extract_features_scriptable(
File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 706, in extract_features_scriptable
x, tok, pos = self.forward_embedding(
File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 575, in forward_embedding
positions = self.embed_positions(
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/private/home/tbmihaylov/metaseq/fairseq/modules/learned_positional_embedding.py", line 53, in forward
return F.embedding(
File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/functional.py", line 2043, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top Results From Across the Web
RuntimeError: Expected all tensors to be on the same device ...
Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument...
Read more >RuntimeError: Expected all tensors to ... - Hugging Face Forums
(when checking arugment for argument index in method ... tensors to be on the same device, but found at least two devices, cpu...
Read more >Loading a model + RuntimeError: Expected all tensors to be ...
RuntimeError : Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! The above error...
Read more >Expected all tensors to be on the same device, but found at ...
JUST CREATE THE INITIAL TOKEN. I get this error: RuntimeError: Expected all tensors to be on the same device, but found at least...
Read more >how to fix it Expected all tensors to be on the same device, but ...
... be on the same device, but found at least two devices, cuda:0 and cpu! ... is tested on python 3.5 + pytorch...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The model seems to pass when set the model_parallel to 2 and gpus=8:
Model setting - model_parallel is 2
Allocation: 8 gpus:
Command:
Log:
Glad you are unblocked 😃 but we should still not fail if TP=2 and world size =2. Can you update the title of the issue to reflect this? I (or someone) else can follow up.