Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One command to run+aggregate distributed evaluation results

See original GitHub issue

Current Situation

In https://github.com/huggingface/transformers/pull/7105, I wrote a three command combo to run distributed eval. The three commands are:

python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py --fp16 --bs 16
python aggregate_distributed_results.py tmp_gen  tmp_gen2
rm -rf tmp_gen

The first command splits up the data and runs generate on a chunk foreach GPU, saving results to rank_{rank}.json
The second command combines the json results, either just calculating metrics or resaving the generations to disk as {save_dir}.pred_target, save_dir.source (optionally) save_dir.target.
the third command deletes the rank.json files.
the saving of these independent files to disk in the second command is useful for pseudolabeling, where we train a small model on the predictions of a big model. We have to do more book-keeping than in run_eval.py because I haven’t yet determined how to reorder predictions to match the original data. So I just save the original data (roughly, there might be truncation issues) and then write it back to disk.

Goal: 1 command that uses multiple gpus, saves aggregated files and computes metrics metrics.json. If this command cannot guarantee the ordering of the generations, it must save necessary data to disk.

The design choices in my dirty first attempt do not need to be continued, as long as we can compute metrics and train a second model with the predictions as the new labels

There are many ways to accomplish this, here are a few ideas (not mutually exclusive).

Ideally, this would be 1 command with the order “figured out” somehow, possibly by returning ids from Seq2SeqDataset

python run_eval.py (existing_args) --gpus 2

would just work. No need to save source/labels since they are in the correct order. To me this sounds hard to implement. I tried briefly and gave up. The goal now is 1 command, it doesn’t need to be the run_eval.py command.

Figuring out ordering by having Seq2Seq dataset return ids

Ideally, this would be 1 command with the order “figured out” somehow, possibly by returning ids from Seq2SeqDataset. Then you don’t need to save labels/source documents to disk, you can just reorder the predictions.

launch n processes in the code rather than from the command line

if we call torch.multiprocessing.spawn ourselves, as in https://github.com/facebookresearch/ParlAI/blob/00efcbebb49524918692638ab580cadeebe70cf8/parlai/scripts/multiprocessing_eval.py#L49

we can wait for the results, join them, and do the reordering in one command.

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

sshleifercommented, Sep 14, 2020

Also metrics diverge a bit from 1 GPU, hopefully because DistributedSortishSampler adds extra examples here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L258

That issue is out of scope for this PR, just a note. I may PR a kwarg to the sampler to not add extra examples separately in a non conflicting way.

Also the code is still 4/10 clean, feel free to rename variables/improve readability as you see fit.

1reaction

sshleifercommented, Sep 14, 2020

Cleaned up a bit here It is super fast!

This is the current workflow though 😦

python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py --model_name sshleifer/distilbart-xsum-12-3 --save_dir tmp_gen --input_path xsum --type_path test  --max_source_length 1024 --length_penalty 0.6
python aggregate_distributed_results.py tmp_gen tmp_gen --just_metrics
mv tmp_gen/metrics.json test_rouge.json
rm -rf tmp_gen