Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Provide comprehensive guide & best-practices for run_language_modeling.py

See original GitHub issue

🚀 Feature request

Provide comprehensive guide for running scripts included in the repository, especially run_language_modeling.py it’s parameters and model configurations.

Motivation

Current version has argparse powered help, from which a lot of parameters seem to be either mysterious or have variable runtime behaviour (i.e tokenizer_name is sometimes path and the value that user provides is expected to provide different data for different models, ie. for Roberta and BERT). Again, when it comes to tokenizer_name - it claims that If both are None, initialize a new tokenizer., which does not work at all, i.e when you use RoBERTa model. It should handle the training of the new tokenizer on provided train_data right away.
There are bunch of parameters that are critical to run the script at all (!), which are not even mentioned here https://huggingface.co/blog/how-to-train or even in the notebook https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb, i.e: for Roberta, without "max_position_embeddings": 514, in config, the script crashes with:
```
CUDA error: device-side assert triggered
```
I had to dig into github to see some unresolved issues around this case and try out a few solutions before the script finally executed (https://github.com/huggingface/transformers/issues/2877).
Models with LM heads will train even though the head output size is different than vocab size of the tokenizer - the script should warn the user or (better) raise an exception in such scenarios.
Describe how the input dataset should look like. Is it required to have one sentence per-line or one article per line or maybe one paragraph per line?
Using multi-GPU on single machine and parameter --evaluate_during_training crashes the script -why? It might be worth an explanation. It’s probably also a bug (https://github.com/huggingface/transformers/issues/1801).
Those are just from the top of my head - I will update this issue once I come up with more or maybe someone else will also add something to this thread.

Given the number of issues currently open, I suspect that I’m not the only one that struggles with the example script. The biggest problem here is that running it without proper configuration might really cost a lot, but the script will still execute, yielding garbage model.

Moreover - by improving the docs and providing best practices guide, you can enable many people with even better toolkit for their research and business.

Issue Analytics

State:
Created 4 years ago
Reactions:22
Comments:6 (2 by maintainers)

Top GitHub Comments

11reactions

marrrcincommented, Mar 17, 2020

I’ve covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/

1reaction

stale[bot]commented, Aug 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

A Guide to Python Good Practices - Towards Data Science

A Guide to Python Good Practices. Revisiting some of the best practices in Python by touching points from project structuring to code formatting ......

Python Best Practices - Every Python Developer Must Know

So, here come some important best practices for Python Coding that you should always keep in mind. 1. Create a Code Repository and...

15 Python Best Practices Developers Must Know - Coding Dojo

These best practices give a lot of guidance when dealing with different structure types, which is crucial for any first-time developer. Below ...

Python Best Practices: A Complete Guide for the Developers

A detailed guide to the best practices for Python developers, including steps, the right approach, and more.

6 Python Best Practices for Better Code - DataCamp

Discover the Python best practices for writing best-in-class Python scripts. Get familiar with PEP 8 and improve your Python code.