Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to train a single model over multiple datasets

See original GitHub issue

❓ Questions and Help

How can I build a model using multiple datasets sampled during training? By this I mean the model randomly samples data from DatasetA and DatasetB during training. The model should also sample one dataset more than the other. Can you point me to how to get that done in fairseq?

fairseq Version: 1.0.0a0+b5a039c
PyTorch Version: 1.10.0
OS (e.g., Linux): linux
How you installed fairseq (pip, source): git source
Build command you used (if compiling from source): git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./
Python version: 3.8.10
CUDA/cuDNN version: cudacore/.11.0.2
GPU models and configuration: NVidia V100SXM2
Any other relevant information:

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:10

Top GitHub Comments

6reactions

gmryucommented, May 26, 2022

I believe current normal fairseq does not provide such feature. Not in command line for sure.

– As for the implementation, If your data is not that huge, (size is decided by how much gpu memory data consumes in total.) One way is to write your custom dataset class. Read fairseq/data/language_pair_dataset.py and copy-paste-edit-finish a new .py yourself. As a result, you have to write a custom task class to utilize your dataset class. To import your custom code into fairseq during runtime, use --user-dir (also you need to search how to use it.)

Append every dataset into one file. Take notes which line starts a new dataset and you will obtain regions for each dataset ( like line 1-10000 is datasetA, 10001-15000 is B, …) In fairseq-preprocess, data will all get preprocessed, so it is fine. Then in training, you have to write your own def collate (the method how dataset create a batch for training) in order to make sure the batch contain the correct ratio of data among different region. You also want to add two new command line argument for obtaining those seperating line indexes, ratio among regions in your custom task. (or you do not add new arguments, but hand written them into .py)

– If you data is huge, you mix up the data beforehead and split them into multiple folders. That is the only option. So you sample epochs from datasetA and datasetB and write them down into data_1_folder, data_2_folder, data_3_folder… by yourself, outside of fairseq. ( you may also use c++ or other language to speedup) Each data in folders are mixture of A and B with correct ratio. Then, you fairseq-preprocess them all.

In training, you can provide multiple data folder for fairseq to train with them in a robin-round fashion. Use the command like this: fairseq-train /path/data_1_folder:/path/data_2_folder:/path/data_3_folder --train-subset train --valid-subset valid .... When you do this, the model will get first epoch form data_1_folder, second epoch from data_2,… When folders run out, the next epoch starts from data_1_folder and all over again. All folders must have their own train-subset, but only the first folder must have the valid-subset.(the rest does not need valid-subset) This folder order is not shuffled by current fairseq, nor switched midway.

(valid is also a mixture of A and B)

– The first one requires a lot of fairseq extension, but also most can be done inside fairseq command line if you implemented it. The second one requires no fairseq extension but a lot of codes excution outside fairseq are required in advance.

You may also think of the third way - write a dataset which can utilize multiple data folder at first. This is also a very good way but requires more understanding into fairseq/fairseq_cli/train.py and how it deals with arguments, create dataset

0reactions

martianmartinacommented, Nov 19, 2022

@gmryu Hi thank you so much for your reply. I think it is a great workaround!