question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Provide comprehensive guide & best-practices for run_language_modeling.py

See original GitHub issue

🚀 Feature request

Provide comprehensive guide for running scripts included in the repository, especially run_language_modeling.py it’s parameters and model configurations.

Motivation

  1. Current version has argparse powered help, from which a lot of parameters seem to be either mysterious or have variable runtime behaviour (i.e tokenizer_name is sometimes path and the value that user provides is expected to provide different data for different models, ie. for Roberta and BERT). Again, when it comes to tokenizer_name - it claims that If both are None, initialize a new tokenizer., which does not work at all, i.e when you use RoBERTa model. It should handle the training of the new tokenizer on provided train_data right away.

  2. There are bunch of parameters that are critical to run the script at all (!), which are not even mentioned here https://huggingface.co/blog/how-to-train or even in the notebook https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb, i.e: for Roberta, without "max_position_embeddings": 514, in config, the script crashes with:

    CUDA error: device-side assert triggered
    

    I had to dig into github to see some unresolved issues around this case and try out a few solutions before the script finally executed (https://github.com/huggingface/transformers/issues/2877).

  3. Models with LM heads will train even though the head output size is different than vocab size of the tokenizer - the script should warn the user or (better) raise an exception in such scenarios.

  4. Describe how the input dataset should look like. Is it required to have one sentence per-line or one article per line or maybe one paragraph per line?

  5. Using multi-GPU on single machine and parameter --evaluate_during_training crashes the script -why? It might be worth an explanation. It’s probably also a bug (https://github.com/huggingface/transformers/issues/1801).

  6. Those are just from the top of my head - I will update this issue once I come up with more or maybe someone else will also add something to this thread.

Given the number of issues currently open, I suspect that I’m not the only one that struggles with the example script. The biggest problem here is that running it without proper configuration might really cost a lot, but the script will still execute, yielding garbage model.

Moreover - by improving the docs and providing best practices guide, you can enable many people with even better toolkit for their research and business.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:22
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

11reactions
marrrcincommented, Mar 17, 2020
1reaction
stale[bot]commented, Aug 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Guide to Python Good Practices - Towards Data Science
A Guide to Python Good Practices. Revisiting some of the best practices in Python by touching points from project structuring to code formatting ......
Read more >
Python Best Practices - Every Python Developer Must Know
So, here come some important best practices for Python Coding that you should always keep in mind. 1. Create a Code Repository and...
Read more >
15 Python Best Practices Developers Must Know - Coding Dojo
These best practices give a lot of guidance when dealing with different structure types, which is crucial for any first-time developer. Below ...
Read more >
Python Best Practices: A Complete Guide for the Developers
A detailed guide to the best practices for Python developers, including steps, the right approach, and more.
Read more >
6 Python Best Practices for Better Code - DataCamp
Discover the Python best practices for writing best-in-class Python scripts. Get familiar with PEP 8 and improve your Python code.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found