One command to run+aggregate distributed evaluation results
See original GitHub issueCurrent Situation
In https://github.com/huggingface/transformers/pull/7105, I wrote a three command combo to run distributed eval. The three commands are:
python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py --fp16 --bs 16
python aggregate_distributed_results.py tmp_gen tmp_gen2
rm -rf tmp_gen
- The first command splits up the data and runs
generate
on a chunk foreach GPU, saving results torank_{rank}.json
- The second command combines the json results, either just calculating metrics or resaving the generations to disk as
{save_dir}.pred_target
,save_dir.source
(optionally)save_dir.target
. - the third command deletes the rank.json files.
- the saving of these independent files to disk in the second command is useful for pseudolabeling, where we train a small model on the predictions of a big model. We have to do more book-keeping than in run_eval.py because I haven’t yet determined how to reorder predictions to match the original data. So I just save the original data (roughly, there might be truncation issues) and then write it back to disk.
Goal: 1 command that uses multiple gpus, saves aggregated files and computes metrics metrics.json
. If this command cannot guarantee the ordering of the generations, it must save necessary data to disk.
The design choices in my dirty first attempt do not need to be continued, as long as we can compute metrics and train a second model with the predictions as the new labels
There are many ways to accomplish this, here are a few ideas (not mutually exclusive).
Ideally, this would be 1 command with the order “figured out” somehow, possibly by returning ids from Seq2SeqDataset
python run_eval.py (existing_args) --gpus 2
would just work. No need to save source/labels since they are in the correct order.
To me this sounds hard to implement. I tried briefly and gave up.
The goal now is 1 command, it doesn’t need to be the run_eval.py
command.
Figuring out ordering by having Seq2Seq dataset return ids
- Ideally, this would be 1 command with the order “figured out” somehow, possibly by returning ids from
Seq2SeqDataset
. Then you don’t need to save labels/source documents to disk, you can just reorder the predictions.
launch n processes in the code rather than from the command line
if we call torch.multiprocessing.spawn
ourselves, as in
https://github.com/facebookresearch/ParlAI/blob/00efcbebb49524918692638ab580cadeebe70cf8/parlai/scripts/multiprocessing_eval.py#L49
we can wait for the results, join them, and do the reordering in one command.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Also metrics diverge a bit from 1 GPU, hopefully because
DistributedSortishSampler
adds extra examples here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L258That issue is out of scope for this PR, just a note. I may PR a kwarg to the sampler to not add extra examples separately in a non conflicting way.
Also the code is still 4/10 clean, feel free to rename variables/improve readability as you see fit.
Cleaned up a bit here It is super fast!
This is the current workflow though 😦