question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Could `join` replace `gather_for_metrics` and perform it automatically?

See original GitHub issue

Hi,

Great job with accelerate!

One persistent headache that I keep experiencing is the repeated samples during distributed evaluation. Although there is the gather_for_metrics functionality this doesn’t always work depending on the output of the model; for example, Faster-RCNN in Torchvision, which outputs a list of dictionaries containing tensors.

I was wondering if it would be possible to remove the repeated behaviour completely, using something like the join context manager, which enables training on uneven outputs.

If there is appetite for this, I would be happy to help you explore options.

Thanks

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Sep 29, 2022

Accelerate was built when there was no join contextmanager and the fact each datalaoder returns the same number of samples in all processes is pretty ingrained in the library. We could make this evolve in the future with a major release, but that would mean a lot of changes (and breaking functionality).

For now I’d rather investigate why your use case is not supported as is (all methods should support a list of dictionaries) rather than rewrite the whole library.

1reaction
sguggercommented, Oct 3, 2022

@Chris-hughes10 This is mostly used in the whole evaluation part of Accelerate (so gather and gather_for_metrics) as for training we don’t really care (plus the dataloaders very often have drop_last=True during training so there is no problem there).

I’m open to start exploring a different way to go, probably with a new flag in the accelerator. The first thing would be to have the batch sampler we have handle a non-fixed batch size (which would also be useful for training), then add this flag where we would not cycle through the dataset but return different lengths on different processes and finally add a wrapper around join.

Does that sound reasonable? If so, I can start a project summarizing the steps and we can share the work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quick tour - Hugging Face
You can perform regular evaluation in your training script, ... the gather_for_metrics() method to automatically remove the duplicated data while gathering.
Read more >
Can we replace right join with left join - YouTube
Why is right join required when we have left join | use of right join | why use ... SQL Server Performance Tuning...
Read more >
发布 · mirrors / huggingface / pytorch-pretrained-bert - GitCode
Table Transformer is a model that can perform table extraction and table structure recognition from unstructured documents based on the DETR architecture.
Read more >
A simple way to train and use PyTorch models with multi-GPU ...
This will generate a config file that will be used automatically to ... around Join to enable training with uneven inputs when using...
Read more >
A simple way to train and use NLP models with multi-GPU ...
As you can see on this example, by adding 5-lines to any standard ... This will generate a config file that will be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found