Provide comprehensive guide & best-practices for run_language_modeling.py
See original GitHub issue🚀 Feature request
Provide comprehensive guide for running scripts included in the repository, especially run_language_modeling.py
it’s parameters and model configurations.
Motivation
-
Current version has
argparse
powered help, from which a lot of parameters seem to be either mysterious or have variable runtime behaviour (i.etokenizer_name
is sometimes path and the value that user provides is expected to provide different data for different models, ie. for Roberta and BERT). Again, when it comes totokenizer_name
- it claims thatIf both are None, initialize a new tokenizer.
, which does not work at all, i.e when you use RoBERTa model. It should handle the training of the new tokenizer on providedtrain_data
right away. -
There are bunch of parameters that are critical to run the script at all (!), which are not even mentioned here https://huggingface.co/blog/how-to-train or even in the notebook https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb, i.e: for Roberta, without
"max_position_embeddings": 514,
in config, the script crashes with:CUDA error: device-side assert triggered
I had to dig into github to see some unresolved issues around this case and try out a few solutions before the script finally executed (https://github.com/huggingface/transformers/issues/2877).
-
Models with LM heads will train even though the head output size is different than vocab size of the tokenizer - the script should warn the user or (better) raise an exception in such scenarios.
-
Describe how the input dataset should look like. Is it required to have one sentence per-line or one article per line or maybe one paragraph per line?
-
Using multi-GPU on single machine and parameter
--evaluate_during_training
crashes the script -why? It might be worth an explanation. It’s probably also a bug (https://github.com/huggingface/transformers/issues/1801). -
Those are just from the top of my head - I will update this issue once I come up with more or maybe someone else will also add something to this thread.
Given the number of issues currently open, I suspect that I’m not the only one that struggles with the example script. The biggest problem here is that running it without proper configuration might really cost a lot, but the script will still execute, yielding garbage model.
Moreover - by improving the docs and providing best practices guide, you can enable many people with even better toolkit for their research and business.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:22
- Comments:6 (2 by maintainers)
I’ve covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.