Trying to pretrain with longer max sequence length
See original GitHub issueI’ve tried to extend your electra openwebtext examples by using your pre-process script with appropriate arguments for max sequence length of 768, as follows:
export PYTHONPATH=.
python pretraining/openwebtext/preprocess.py \
--max-seq-length 768 \
--trg-dir data/openwebtext_features_768 \
--n-dataset-building-processes 8
I then have custom json files for the electra generator and discriminator with the following change from the small_generator.json and small_discriminator.json:
"embedding_size": 768,
When I try to do pretraining with the new tokenization and the changed specification I get what looks like a dimension mismatch during training. I have been careful to pass the modified data folder with the new, longer feature tensors.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
token indices sequence length is longer than the specified ...
When I use Bert, the "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)"...
Read more >Fine-tuning BERT with sequences longer than 512 tokens
Model's Hub handle a maximum input length of 512. Using sequences longer than 512 seems to require training the models from scratch, which...
Read more >Token indices sequence length is longer than the specified ...
I am using the transformer.encodeplus() method to pass the text into the model. I have tried various mechanisms to truncate the input ids...
Read more >Context is Everything: Why Maximum Sequence Length Matters
The ability to process long sequences would make distributed training even more cumbersome and slow, because the computation required for each ...
Read more >efficient sequence packing without cross-contamination - arXiv
(IPU-M2000, 16 accelerator chips), BERT phase 2 pretraining setup as in ... For Phase 2, we use sequence length 384 since longer range ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@DarrenAbramson I think I may have spotted the error. So the
"max_position_embeddings": 512
needs to be increased to 768 as well, or you will hit an out of bounds when trying to get that positional encoding greater than the default (512)No problem, I’ll see if I can add a line of code to guard against that later tonight. Thanks for trying this out!