question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to train a single model over multiple datasets

See original GitHub issue

❓ Questions and Help

How can I build a model using multiple datasets sampled during training? By this I mean the model randomly samples data from DatasetA and DatasetB during training. The model should also sample one dataset more than the other. Can you point me to how to get that done in fairseq?

  • fairseq Version: 1.0.0a0+b5a039c
  • PyTorch Version: 1.10.0
  • OS (e.g., Linux): linux
  • How you installed fairseq (pip, source): git source
  • Build command you used (if compiling from source): git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./
  • Python version: 3.8.10
  • CUDA/cuDNN version: cudacore/.11.0.2
  • GPU models and configuration: NVidia V100SXM2
  • Any other relevant information:

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:10

github_iconTop GitHub Comments

6reactions
gmryucommented, May 26, 2022

I believe current normal fairseq does not provide such feature. Not in command line for sure.

– As for the implementation, If your data is not that huge, (size is decided by how much gpu memory data consumes in total.) One way is to write your custom dataset class. Read fairseq/data/language_pair_dataset.py and copy-paste-edit-finish a new .py yourself. As a result, you have to write a custom task class to utilize your dataset class. To import your custom code into fairseq during runtime, use --user-dir (also you need to search how to use it.)

Append every dataset into one file. Take notes which line starts a new dataset and you will obtain regions for each dataset ( like line 1-10000 is datasetA, 10001-15000 is B, …) In fairseq-preprocess, data will all get preprocessed, so it is fine. Then in training, you have to write your own def collate (the method how dataset create a batch for training) in order to make sure the batch contain the correct ratio of data among different region. You also want to add two new command line argument for obtaining those seperating line indexes, ratio among regions in your custom task. (or you do not add new arguments, but hand written them into .py)

– If you data is huge, you mix up the data beforehead and split them into multiple folders. That is the only option. So you sample epochs from datasetA and datasetB and write them down into data_1_folder, data_2_folder, data_3_folder… by yourself, outside of fairseq. ( you may also use c++ or other language to speedup) Each data in folders are mixture of A and B with correct ratio. Then, you fairseq-preprocess them all.

In training, you can provide multiple data folder for fairseq to train with them in a robin-round fashion. Use the command like this: fairseq-train /path/data_1_folder:/path/data_2_folder:/path/data_3_folder --train-subset train --valid-subset valid .... When you do this, the model will get first epoch form data_1_folder, second epoch from data_2,… When folders run out, the next epoch starts from data_1_folder and all over again. All folders must have their own train-subset, but only the first folder must have the valid-subset.(the rest does not need valid-subset) This folder order is not shuffled by current fairseq, nor switched midway.

(valid is also a mixture of A and B)

– The first one requires a lot of fairseq extension, but also most can be done inside fairseq command line if you implemented it. The second one requires no fairseq extension but a lot of codes excution outside fairseq are required in advance.

You may also think of the third way - write a dataset which can utilize multiple data folder at first. This is also a very good way but requires more understanding into fairseq/fairseq_cli/train.py and how it deals with arguments, create dataset

0reactions
martianmartinacommented, Nov 19, 2022

@gmryu Hi thank you so much for your reply. I think it is a great workaround!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to train one model with two datasets - Fast.ai forums
Create dl1 and learner1 · Train and save the learner1 · Create new dl2 and recreate learner1 · Load the saved model in...
Read more >
Can I train a neural network with multiple datasets ... - Quora
Yes. It is the right way to get a good performing model. You can set the number of epochs the model should be...
Read more >
Train neural network model on multiple datasets
Step 1: Necessary imports · Step 2: Creating and building model · Step 3: Define fit function (performs model training) · Step 4:...
Read more >
How to Train Multiple Datasets on a single model? - Kaggle
Run separate models. Initialize all other values to 0. Only run when you have all the data points. Any suggestions/resources are much appreciated....
Read more >
Best Way to combine multiple datasets into one model
Another better option that you have is as you said in your question you want to train your model on multiple classes you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found