question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One command to run+aggregate distributed evaluation results

See original GitHub issue

Current Situation

In https://github.com/huggingface/transformers/pull/7105, I wrote a three command combo to run distributed eval. The three commands are:

python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py --fp16 --bs 16
python aggregate_distributed_results.py tmp_gen  tmp_gen2
rm -rf tmp_gen
  • The first command splits up the data and runs generate on a chunk foreach GPU, saving results to rank_{rank}.json
  • The second command combines the json results, either just calculating metrics or resaving the generations to disk as {save_dir}.pred_target, save_dir.source (optionally) save_dir.target.
  • the third command deletes the rank.json files.
  • the saving of these independent files to disk in the second command is useful for pseudolabeling, where we train a small model on the predictions of a big model. We have to do more book-keeping than in run_eval.py because I haven’t yet determined how to reorder predictions to match the original data. So I just save the original data (roughly, there might be truncation issues) and then write it back to disk.

Goal: 1 command that uses multiple gpus, saves aggregated files and computes metrics metrics.json. If this command cannot guarantee the ordering of the generations, it must save necessary data to disk.

The design choices in my dirty first attempt do not need to be continued, as long as we can compute metrics and train a second model with the predictions as the new labels

There are many ways to accomplish this, here are a few ideas (not mutually exclusive).

Ideally, this would be 1 command with the order “figured out” somehow, possibly by returning ids from Seq2SeqDataset

python run_eval.py (existing_args) --gpus 2

would just work. No need to save source/labels since they are in the correct order. To me this sounds hard to implement. I tried briefly and gave up. The goal now is 1 command, it doesn’t need to be the run_eval.py command.

Figuring out ordering by having Seq2Seq dataset return ids

  • Ideally, this would be 1 command with the order “figured out” somehow, possibly by returning ids from Seq2SeqDataset. Then you don’t need to save labels/source documents to disk, you can just reorder the predictions.

launch n processes in the code rather than from the command line

if we call torch.multiprocessing.spawn ourselves, as in https://github.com/facebookresearch/ParlAI/blob/00efcbebb49524918692638ab580cadeebe70cf8/parlai/scripts/multiprocessing_eval.py#L49

we can wait for the results, join them, and do the reordering in one command.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
sshleifercommented, Sep 14, 2020

Also metrics diverge a bit from 1 GPU, hopefully because DistributedSortishSampler adds extra examples here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L258

That issue is out of scope for this PR, just a note. I may PR a kwarg to the sampler to not add extra examples separately in a non conflicting way.

Also the code is still 4/10 clean, feel free to rename variables/improve readability as you see fit.

1reaction
sshleifercommented, Sep 14, 2020

Cleaned up a bit here It is super fast!

This is the current workflow though 😦

python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py --model_name sshleifer/distilbart-xsum-12-3 --save_dir tmp_gen --input_path xsum --type_path test  --max_source_length 1024 --length_penalty 0.6
python aggregate_distributed_results.py tmp_gen tmp_gen --just_metrics
mv tmp_gen/metrics.json test_rouge.json
rm -rf tmp_gen
Read more comments on GitHub >

github_iconTop Results From Across the Web

An Introduction to Using SQL Aggregate Functions with JOINs
FUNCTION, PURPOSE, EXAMPLE. MIN, Returns the smallest value in a column. SELECT MIN(column) FROM table_name. MAX, Returns the largest value ...
Read more >
Build a near real-time data aggregation pipeline using a ...
Update the values in the aggregate table using a single transactional write operation that increments all the current values with the results ......
Read more >
Use the eval command and functions - Splunk Documentation
The eval command enables you to devise arbitrary expressions that use automatically extracted fields to create a new field that takes the value...
Read more >
Aggregations | Redis
Aggregations are a way to process the results of a search query, group, sort and transform them - and extract analytic insights from...
Read more >
MongoDB Aggregation Pipeline
There are three ways to perform aggregation in MongoDB: ... Pipeline operators need not produce one output document for every input document, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found