Understanding the language modeling data format
See original GitHub issueHello everyone. I’m familiar with Fairseq for translation, but so far I haven’t used it for language modeling. In translation, each line is considered to be the basic unit, and the source ones must be aligned with the target ones. I don’t quite understand what are the basic units in language modeling.
I was following some examples, like the one for training RoBERTa from scratch with your own data (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md). It is said that “Data should be preprocessed following the language modeling format.”, but I haven’t found the specification for this format.
I have downloaded the wikitext dataset and the training set file starts with:
= Valkyria Chronicles III =
Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .
The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 .
= = Gameplay = =
I infer that the format is something like: `\n = < Document heading > = \n <Text (potentially including end of lines, but not double end of lines (?) > \n \n = = < Document subheading > = = \n …
Is this what is meant by language modeling format? Are headings ignored? Also, in TokenBlockDataset, I understand that text is treated as a 1D stream of data. If that is the case, and I have a set of different documents concatenated following the said language modeling data format, will text from different documents be mixed? There is the argument ‘–sample-break-mode’ with options ‘{none,complete,complete_doc,eos}’, and the optional ‘document_sep_len (int, optional): document separator size (required for ‘complete_doc’ break mode). Typically 1 if the sentences have eos and 0 otherwise.’
I don’t understand the exact meaning of ‘document_sep_len’.
To sum up, and please excuse me for the long message, suppose that I have a set of documents and I don’t want them to be mixed in the same samples (ie. text from other documents should not be used for predicting text in one document). The way to go should be:
- Concat all the documents following the aforementioned format. The names of the documents should be written as ‘\n = < Document heading > = \n’.
- Set --sample-break-model to complete_doc and document_sep_len to ?.
Finally, if I want some documents to be mixed (ie. using information from each other as context), would it be a good idea to put each of these document after their respective title as a sub-heading, within the same heading?
Many thanks in advance.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:7 (4 by maintainers)
Top GitHub Comments
I think a lot of your questions about headings and subheadings are about how wikitext103, the dataset, presents them- which is simply as sentences just like any other sentence. Thus, we treat headings and subheadings the same as paragraph information, which is as a stream of text
If you click on the link it should take you to the instructions for preprocessing data for the language modeling task [1]
If you’d like to ensure that text from different documents does not get merged into the same block, then you should use the
--sample-break-mode complete_doc
. Documents in your input dataset should be separated by an empty line.