question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resuming training size mismatch

See original GitHub issue

getting size mismatches on the entire checkpoint. This sort of thing.

        size mismatch for transformer.layers.blocks.12.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.12.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for transformer.layers.blocks.13.f.net.fn.fn.to_qkv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1536, 1024]).
        size mismatch for transformer.layers.blocks.13.f.net.fn.fn.to_out.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
        size mismatch for transformer.layers.blocks.13.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.13.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for transformer.layers.blocks.14.f.net.fn.fn.to_qkv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1536, 1024]).
        size mismatch for transformer.layers.blocks.14.f.net.fn.fn.to_out.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
        size mismatch for transformer.layers.blocks.14.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.14.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for transformer.layers.blocks.15.f.net.fn.fn.to_qkv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1536, 1024]).
        size mismatch for transformer.layers.blocks.15.f.net.fn.fn.to_out.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
        size mismatch for transformer.layers.blocks.15.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.15.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for to_logits.1.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([50688, 1024]).
Killing subprocess 676

whenever I resume from a checkpoint. Were the checkpoint keys change recently?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
janEbertcommented, Apr 22, 2021

Yeah, we can’t avoid that now if we want to support offloading and partitioning. I’ll fix it.

1reaction
janEbertcommented, Apr 28, 2021

The possible issue I’m seeing is that DeepSpeed does not handle multiple ZeRO-enabled models at once for whatever reason (e.g. both wanting to take all GPU memory, unhandled shared global state, …). Not sure, though, I’d need to look into it. If, only if, that’s the case, splitting the models wouldn’t help either. I haven’t figured out how to handle that case quite yet. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loaded keras model fails to continue training, dimensions ...
I want to be able to resume training later, as my real dataset is much larger. Therefore, saving only the weights is not...
Read more >
Training my pretrained model in different dataset and I got an ...
RuntimeError: Error(s) in loading state_dict for Generator: size mismatch for crop_encoder.bn1.embed.weight: copying a param with shape torch.
Read more >
Issue with loading model - fast.ai Course Forums
Size ([616, 512]). size mismatch for 1.8.bias: copying a param with shape torch ... I have spent a long time training a model...
Read more >
Size mismatch Simulink after model restart - MATLAB Answers
previously, the Simulink model compiled without problems, but after a restart it gives the error: "Size mismatch (size [1x1] ~= size[5x1])" ...
Read more >
How to Solve HP Paper Mismatch Error for Windows
The “Paper Mismatch” or “Paper Size Mismatch” error can occur when the paper, envelopes or other media loaded in the printer tray or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found