Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error finetuning from pretrained checkpoint

See original GitHub issue

Hi all, I’m running into an error when trying to fine-tune from one of the pretrained checkpoints.

Code

!mkdir "$output"
!wget -q -O "$output/checkpoint.pth" https://dl.fbaipublicfiles.com/dino/dino_deitsmall16_pretrain/dino_deitsmall16_pretrain.pth

!python -m torch.distributed.launch \
  --nproc_per_node=1 ./dino/main_dino.py \
  --arch deit_small \
  --data_path "$input" \
  --output_dir "$output"

Error

| distributed init (rank 0): env://
git:
  sha: 8aa93fdc90eae4b183c4e3c005174a9f634ecfbf, status: clean, branch: main

arch: deit_small
batch_size_per_gpu: 64
...
...
Student and Teacher are built: they are both deit_small network.
Loss, optimizer and schedulers ready.
Found checkpoint at ./drive/MyDrive/DINO/checkpoint.pth
=> failed to load student from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load teacher from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load optimizer from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load fp16_scaler from checkpoint './drive/MyDrive/DINO/checkpoint.pth'
=> failed to load dino_loss from checkpoint './drive/MyDrive/DINO/checkpoint.pth'

Any suggestions would be very much appreciated.

Issue Analytics

State:
Created 2 years ago
Comments:12 (2 by maintainers)

Top GitHub Comments

4reactions

yadamonkcommented, May 13, 2021

Hi @ymathildecaron31

Thank you so much for your wonderful work and all the time you’re putting into helping others build on it.

3reactions

yadamonkcommented, May 8, 2021

It looks like the checkpoints were trained on a slightly different version of the released code. Luckily it’s not difficult to change the names of the affected keys.

!wget -q -O "checkpoint.pth" https://dl.fbaipublicfiles.com/dino/dino_deitsmall16_pretrain/dino_deitsmall16_pretrain_full_checkpoint.pth

import gc
import torch

checkpoint = torch.load("checkpoint.pth", map_location="cpu")

student = {}

for key, value in checkpoint['student'].items():

  if "projection_head" in key:
    student['module.' + key.replace("projection_head", "mlp")] = value

  elif "prototypes" in key:
    student['module.' + key.replace("prototypes", "last_layer")] = value
    
  else:
    student['module.' + key] = value

teacher = {}

for key, value in checkpoint['teacher'].items():

  if "projection_head" in key:
    teacher[key.replace("projection_head", "mlp")] = value

  elif "prototypes" in key:
    teacher[key.replace("prototypes", "last_layer")] = value

  else:
    teacher[key] = value

torch.save({
            'student': student,
            'teacher': teacher,
            'epoch': checkpoint['epoch'],
            'optimizer': checkpoint['optimizer']
            }, "checkpoint.pth")

del checkpoint, student, teacher
gc.collect();

Now training starts at a much smaller loss and I see the following message.

Found checkpoint at ./checkpoint.pth
=> loaded student from checkpoint './checkpoint.pth' with msg <All keys matched successfully>
=> loaded teacher from checkpoint './checkpoint.pth' with msg <All keys matched successfully>