Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Could `join` replace `gather_for_metrics` and perform it automatically?

See original GitHub issue

Hi,

Great job with accelerate!

One persistent headache that I keep experiencing is the repeated samples during distributed evaluation. Although there is the gather_for_metrics functionality this doesn’t always work depending on the output of the model; for example, Faster-RCNN in Torchvision, which outputs a list of dictionaries containing tensors.

I was wondering if it would be possible to remove the repeated behaviour completely, using something like the join context manager, which enables training on uneven outputs.

If there is appetite for this, I would be happy to help you explore options.

Thanks

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:14 (10 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Sep 29, 2022

Accelerate was built when there was no join contextmanager and the fact each datalaoder returns the same number of samples in all processes is pretty ingrained in the library. We could make this evolve in the future with a major release, but that would mean a lot of changes (and breaking functionality).

For now I’d rather investigate why your use case is not supported as is (all methods should support a list of dictionaries) rather than rewrite the whole library.

1reaction

sguggercommented, Oct 3, 2022

@Chris-hughes10 This is mostly used in the whole evaluation part of Accelerate (so gather and gather_for_metrics) as for training we don’t really care (plus the dataloaders very often have drop_last=True during training so there is no problem there).

I’m open to start exploring a different way to go, probably with a new flag in the accelerator. The first thing would be to have the batch sampler we have handle a non-fixed batch size (which would also be useful for training), then add this flag where we would not cycle through the dataset but return different lengths on different processes and finally add a wrapper around join.

Does that sound reasonable? If so, I can start a project summarizing the steps and we can share the work.

Top Results From Across the Web

Quick tour - Hugging Face

You can perform regular evaluation in your training script, ... the gather_for_metrics() method to automatically remove the duplicated data while gathering.

Can we replace right join with left join - YouTube

Why is right join required when we have left join | use of right join | why use ... SQL Server Performance Tuning...

发布 · mirrors / huggingface / pytorch-pretrained-bert - GitCode

Table Transformer is a model that can perform table extraction and table structure recognition from unstructured documents based on the DETR architecture.

A simple way to train and use PyTorch models with multi-GPU ...

This will generate a config file that will be used automatically to ... around Join to enable training with uneven inputs when using...

A simple way to train and use NLP models with multi-GPU ...

As you can see on this example, by adding 5-lines to any standard ... This will generate a config file that will be...