Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understanding the language modeling data format

See original GitHub issue

Hello everyone. I’m familiar with Fairseq for translation, but so far I haven’t used it for language modeling. In translation, each line is considered to be the basic unit, and the source ones must be aligned with the target ones. I don’t quite understand what are the basic units in language modeling.

I was following some examples, like the one for training RoBERTa from scratch with your own data (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md). It is said that “Data should be preprocessed following the language modeling format.”, but I haven’t found the specification for this format.

I have downloaded the wikitext dataset and the training set file starts with:

 = Valkyria Chronicles III = 
 
 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 
 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 
 It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 . 
 
 = = Gameplay = =

I infer that the format is something like: `\n = < Document heading > = \n <Text (potentially including end of lines, but not double end of lines (?) > \n \n = = < Document subheading > = = \n …

Is this what is meant by language modeling format? Are headings ignored? Also, in TokenBlockDataset, I understand that text is treated as a 1D stream of data. If that is the case, and I have a set of different documents concatenated following the said language modeling data format, will text from different documents be mixed? There is the argument ‘–sample-break-mode’ with options ‘{none,complete,complete_doc,eos}’, and the optional ‘document_sep_len (int, optional): document separator size (required for ‘complete_doc’ break mode). Typically 1 if the sentences have eos and 0 otherwise.’

I don’t understand the exact meaning of ‘document_sep_len’.

To sum up, and please excuse me for the long message, suppose that I have a set of documents and I don’t want them to be mixed in the same samples (ie. text from other documents should not be used for predicting text in one document). The way to go should be:

Concat all the documents following the aforementioned format. The names of the documents should be written as ‘\n = < Document heading > = \n’.
Set --sample-break-model to complete_doc and document_sep_len to ?.

Finally, if I want some documents to be mixed (ie. using information from each other as context), would it be a good idea to put each of these document after their respective title as a sub-heading, within the same heading?

Many thanks in advance.

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:7 (4 by maintainers)

Top GitHub Comments

5reactions

huihuifancommented, Sep 25, 2019

I think a lot of your questions about headings and subheadings are about how wikitext103, the dataset, presents them- which is simply as sentences just like any other sentence. Thus, we treat headings and subheadings the same as paragraph information, which is as a stream of text

2reactions

lematt1991commented, Sep 24, 2019

but I haven’t found the specification for this format.

If you click on the link it should take you to the instructions for preprocessing data for the language modeling task [1]

If you’d like to ensure that text from different documents does not get merged into the same block, then you should use the --sample-break-mode complete_doc. Documents in your input dataset should be separated by an empty line.

Top Results From Across the Web

Language Modeling Data Formats - Simple Transformers

For Language Modeling tasks, the input data should be in a text file with one text sample per row. The format is identical...

A beginner's guide to language models - Towards Data Science

A language model is basically a probability distribution over words or word sequences. In practice, a language model gives the probability of a ......

A Beginner's Guide to Language Models | Built In

A language model is a probability distribution over words or word sequences. Learn more about different types of language models and what ...

What is Language Modeling? - TechTarget

Language models analyze bodies of text data to provide a basis for their word predictions. They are used in natural language processing (NLP)...

Understanding the language modeling data format · Issue #1172

Hello everyone. I'm familiar with Fairseq for translation, but so far I haven't used it for language modeling. In translation, each line is ......