Adafactor does not work with Resnets (or with MAML)
See original GitHub issueEnvironment info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.10.3- Platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.9.7
- PyTorch version (GPU?): 1.9.1+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes (NVIDIA GeForce …)
- Using distributed or parallel set-up in script?: no
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I am running the MAML (with higher) meta-learning algorithm with a resnet. I see this gives issues in my script (error message pasted bellow). Is Adafactor not suppose to work with Resnets or other models?
Steps to reproduce the behavior:
- run this code: https://github.com/brando90/higher/blob/master/examples/maml-omniglot.py (it already has adafactor)
- if that works uncomment the resnet12 line and ping me please
Expected behavior
I expect training to go smoothly but isntead get:
--------------------- META-TRAIN ------------------------
Starting training!
Traceback (most recent call last):
File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 441, in <module>
main_resume_from_checkpoint(args)
File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 403, in main_resume_from_checkpoint
run_training(args)
File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 413, in run_training
meta_train_fixed_iterations(args)
File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 233, in meta_train_fixed_iterations
args.outer_opt.step()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/transformers/optimization.py", line 577, in step
update = self._approx_sq_grad(exp_avg_sq_row, exp_avg_sq_col)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/transformers/optimization.py", line 508, in _approx_sq_grad
return torch.mm(r_factor.unsqueeze(-1), c_factor.unsqueeze(0))
RuntimeError: mat1 must be a matrix, got 4-D tensor
full error output:
('PID', '25721')
('always_use_deterministic_algorithms', False)
('args_hardcoded_in_script', False)
('base_model_mode', 'resnet12_rsf')
('best_val_loss', inf)
('condor_jobid', -1)
('copy_initial_weights', False)
('current_logs_path', '/home/miranda9/data/logs/logs_Nov05_15-44-03_jobid_668')
('current_time', 'Nov30_08-42-53')
('data_path', 'miniimagenet')
('debug', False)
('debug_test', False)
('device', device(type='cuda'))
('epoch_num', -1)
('eval_iters', 2)
('experiment_name', 'debug')
('fo', False)
('force_log', True)
('githash', '9af491c')
('githash_long', '9af491ccd13fa88f4d07287f54305488ba4967fc')
('githash_short', '9af491c')
('gpu_name', 'NVIDIA GeForce GTX TITAN X')
('grad_clip_mode', None)
('grad_clip_rate', None)
('hostname', 'vision-02.cs.illinois.edu')
('inner_debug_eval', False)
('inner_debug_train', False)
('inner_lr', 0.1)
('it', 0)
('jobid', 10340)
('k_eval', 15)
('k_shots', 5)
('log_root', PosixPath('/home/miranda9/data/logs/logs_Nov30_08-42-53_jobid_10340'))
('log_to_wandb', True)
('log_train_freq', 200)
('log_val_freq', 200)
('logger', <uutils.logger.Logger object at 0x2b832f5eff70>)
('logging', True)
('mail_user', 'brando.science@gmail.com')
('master_port', '37126')
('meta_batch_size_eval', 2)
('meta_batch_size_train', 2)
('meta_learner', 'maml_fixed_inner_lr')
('metrics_as_dist', False)
('my_stdout_filepath', '/home/miranda9/data/logs/logs_Nov05_15-44-03_jobid_668/my_stdout.log')
('n_classes', 5)
('nb_inner_train_steps', 4)
('nccl', 2708)
('num_epochs', -1)
('num_its', 3)
('num_workers', 4)
('outer_debug', False)
('outer_lr', 0.001)
('path_to_checkpoint', PosixPath('/home/miranda9/data_folder_fall2020_spring2021/logs/nov_all_mini_imagenet_expts/logs_Nov05_15-44-03_jobid_668'))
('pin_memory', False)
('pw_path', '/home/miranda9/pw_app.config.json')
('rank', -1)
('run_name', 'debug (Adafactor) : args.jobid=10340')
('save_ckpt', True)
('seed', None)
('serial', False)
('show_layerwise_sims', False)
('sim_compute_parallel', False)
('slurm_array_task_id', -1)
('slurm_jobid', 10340)
('split', 'train')
('tb', True)
('track_higher_grads', True)
('train_iters', 500000)
('trainin_with_epochs', False)
('training_mode', 'iterations')
('wandb_entity', 'brando')
('wandb_group', 'experiment_debug')
('wandb_project', 'sl_vs_ml_iclr_workshop_paper')
------- Main Resume from Checkpoint --------
args.base_model=ResNet(
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(64, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(160, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(160, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(320, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(320, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(avgpool): AdaptiveAvgPool2d(output_size=1)
(dropout): Dropout(p=0.0, inplace=False)
(classifier): Linear(in_features=640, out_features=5, bias=True)
)
args.outer_opt=Adafactor (
Parameter Group 0
beta1: None
clip_threshold: 1.0
decay_rate: -0.8
eps: (1e-30, 0.001)
lr: None
relative_step: True
scale_parameter: True
warmup_init: True
weight_decay: 0.0
)
args.meta_learner=MAMLMetaLearner(
(base_model): ResNet(
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(64, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(160, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(160, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(320, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): LeakyReLU(negative_slope=0.1)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(downsample): Sequential(
(0): Conv2d(320, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(640, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(DropBlock): DropBlock()
)
)
(avgpool): AdaptiveAvgPool2d(output_size=1)
(dropout): Dropout(p=0.0, inplace=False)
(classifier): Linear(in_features=640, out_features=5, bias=True)
)
)
args.scheduler=None
--------------------- META-TRAIN ------------------------
Starting training!
Traceback (most recent call last):
File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 441, in <module>
main_resume_from_checkpoint(args)
File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 403, in main_resume_from_checkpoint
run_training(args)
File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 413, in run_training
meta_train_fixed_iterations(args)
File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 233, in meta_train_fixed_iterations
args.outer_opt.step()
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/transformers/optimization.py", line 577, in step
update = self._approx_sq_grad(exp_avg_sq_row, exp_avg_sq_col)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/transformers/optimization.py", line 508, in _approx_sq_grad
return torch.mm(r_factor.unsqueeze(-1), c_factor.unsqueeze(0))
RuntimeError: mat1 must be a matrix, got 4-D tensor
related:
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Adafactor from transformers hugging face only works with ...
Adafactor from transformers hugging face only works with Transfromers - does it not work with Resnets and MAML with higher? Ask Question.
Read more >Adafactor from transformers hugging face only works with ... - Reddit
Adafactor from transformers hugging face only works with Transfromers - does it not work with Resnets and MAML with higher?
Read more >Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
We reproduce the pseudocode for the Adam optimizer in. Algorithm 1 for reference (Kingma & Ba, 2015). The setup of the problem is...
Read more >Feature Learning in Infinite- Width Neural ... - Physics ∩ ML
Classify all “viable” ∞-width limits into feature learning and kernel limits. • Identify “the maximal” feature learning limit.
Read more >Model Agnostic Meta-Learning made simple | InstaDeep
After 1 gradient step, the agent's policy solves the problem. GitHub Repository. A Simple Extension: CAVIA. MAML is a great algorithm, but there ......
Read more >
Top Related Medium Post
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @brando90,
transformers
is meant as a library of model architectures more than a library of optimizers, and we’re actively moving away from maintaining optimizers. We’d rather you rely on a library that actively maintain them as the support should be both broader (not tested only ontransformers
, like it is here) and more complete (not limited to the two optimizers that we support here).Some that come to mind are pytorch-optimizer or Fairseq.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.