question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trying to pretrain with longer max sequence length

See original GitHub issue

I’ve tried to extend your electra openwebtext examples by using your pre-process script with appropriate arguments for max sequence length of 768, as follows:

export PYTHONPATH=.
python pretraining/openwebtext/preprocess.py \
--max-seq-length 768 \
--trg-dir data/openwebtext_features_768 \
--n-dataset-building-processes 8

I then have custom json files for the electra generator and discriminator with the following change from the small_generator.json and small_discriminator.json:

"embedding_size": 768,

When I try to do pretraining with the new tokenization and the changed specification I get what looks like a dimension mismatch during training. I have been careful to pass the modified data folder with the new, longer feature tensors.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
lucidrainscommented, Sep 21, 2020

@DarrenAbramson I think I may have spotted the error. So the "max_position_embeddings": 512 needs to be increased to 768 as well, or you will hit an out of bounds when trying to get that positional encoding greater than the default (512)

0reactions
lucidrainscommented, Sep 21, 2020

No problem, I’ll see if I can add a line of code to guard against that later tonight. Thanks for trying this out!

Read more comments on GitHub >

github_iconTop Results From Across the Web

token indices sequence length is longer than the specified ...
When I use Bert, the "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)"...
Read more >
Fine-tuning BERT with sequences longer than 512 tokens
Model's Hub handle a maximum input length of 512. Using sequences longer than 512 seems to require training the models from scratch, which...
Read more >
Token indices sequence length is longer than the specified ...
I am using the transformer.encodeplus() method to pass the text into the model. I have tried various mechanisms to truncate the input ids...
Read more >
Context is Everything: Why Maximum Sequence Length Matters
The ability to process long sequences would make distributed training even more cumbersome and slow, because the computation required for each ...
Read more >
efficient sequence packing without cross-contamination - arXiv
(IPU-M2000, 16 accelerator chips), BERT phase 2 pretraining setup as in ... For Phase 2, we use sequence length 384 since longer range ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found