Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem loading pretrained stories model

See original GitHub issue

I am having the same issue as #209 (now closed). When I run generate.py on the model files in stories_checkpoint.tar.bz2, I get:

| [wp_source] dictionary: 19032 types
| [wp_target] dictionary: 112832 types
| data-bin/writingPrompts test 15138 examples
| loading model(s) from models/fusion_checkpoint.pt
| loading pretrained model
RuntimeError: Error(s) in loading state_dict for FConvModelSelfAtt:
        While copying the parameter named "encoder.encoder.embed_tokens.weight", whose dimensions in the model are torch.Size([19032, 256]) and whose dimensions in the checkpoint are torch.Size([19025, 256]).
        While copying the parameter named "decoder.embed_tokens.weight", whose dimensions in the model are torch.Size([112832, 256]) and whose dimensions in the checkpoint are torch.Size([104960, 256]).
        While copying the parameter named "decoder.fc3.weight", whose dimensions in the model are torch.Size([112832, 256]) and whose dimensions in the checkpoint are torch.Size([104960, 256]).
        While copying the parameter named "decoder.fc3.bias", whose dimensions in the model are torch.Size([112832]) and whose dimensions in the checkpoint are torch.Size([104960]).

I can see that it’s some discrepancy in the vocabulary sizes between the model and checkpoint that is causing the problem. However, I binarized the writingPrompts dataset by running preprocess.py exactly as specified in the example. Here’s the output of that script:

| [wp_source] Dictionary: 19031 types
| [wp_source] examples/stories/writingPrompts/train.wp_source: 272600 sents, 8008372 tokens, 1.36% replaced by <unk>
| [wp_source] Dictionary: 19031 types
| [wp_source] examples/stories/writingPrompts/valid.wp_source: 15620 sents, 469336 tokens, 2.1% replaced by <unk>
| [wp_source] Dictionary: 19031 types
| [wp_source] examples/stories/writingPrompts/test.wp_source: 15138 sents, 440659 tokens, 2.24% replaced by <unk>
| [wp_target] Dictionary: 112831 types
| [wp_target] examples/stories/writingPrompts/train.wp_target: 272600 sents, 184176859 tokens, 0.771% replaced by <unk>
| [wp_target] Dictionary: 112831 types
| [wp_target] examples/stories/writingPrompts/valid.wp_target: 15620 sents, 10496165 tokens, 0.888% replaced by <unk>
| [wp_target] Dictionary: 112831 types
| [wp_target] examples/stories/writingPrompts/test.wp_target: 15138 sents, 10244721 tokens, 0.889% replaced by <unk>

Issue Analytics

State:
Created 5 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

hmc-cs-mdrissicommented, Jul 17, 2018

To deal with the pad dictionary, there’s a preprocessing option --padding-factor that prior to that commit was effectively defaulting to 1 but now defaults to 8. If you explicitly pass in that option as 1 that should let you get the same vocab size.

1reaction

roemmelecommented, Jul 17, 2018

Ok, I did this and re-ran preprocess.py. The resulting vocab size for the target data matched the pretrained model, but my source vocab still had a few extra tokens. I eventually rolled back to a previous commit (745d5fbd7f640e1fd04f17981c4816659ad64c04) and re-ran preprocess.py in order to get the same source vocab as the pretrained model, so there seems to be some recent change that affected the vocab sizes. Once I did this I could run generate.py with the current commit.