Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trying to pretrain with longer max sequence length

See original GitHub issue

I’ve tried to extend your electra openwebtext examples by using your pre-process script with appropriate arguments for max sequence length of 768, as follows:

export PYTHONPATH=.
python pretraining/openwebtext/preprocess.py \
--max-seq-length 768 \
--trg-dir data/openwebtext_features_768 \
--n-dataset-building-processes 8

I then have custom json files for the electra generator and discriminator with the following change from the small_generator.json and small_discriminator.json:

"embedding_size": 768,

When I try to do pretraining with the new tokenization and the changed specification I get what looks like a dimension mismatch during training. I have been careful to pass the modified data folder with the new, longer feature tensors.

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

lucidrainscommented, Sep 21, 2020

@DarrenAbramson I think I may have spotted the error. So the "max_position_embeddings": 512 needs to be increased to 768 as well, or you will hit an out of bounds when trying to get that positional encoding greater than the default (512)

0reactions

lucidrainscommented, Sep 21, 2020

No problem, I’ll see if I can add a line of code to guard against that later tonight. Thanks for trying this out!