Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Explanation of the RepeatFactorTrainingSampler.repeat_factors_from_category_frequency(`repeat_thresh`)

See original GitHub issue

📚 RepeatFactorTrainingSampler.repeat_factors_from_category_frequency

From the docs repeat_factors_from_category_frequency

repeat_thresh (float) – frequency threshold below which data is repeated. If the frequency is half of repeat_thresh, the image will be repeated twice.

In the source code i find these lines:

# 2. For each category c, compute the category-level repeat factor:
#    r(c) = max(1, sqrt(t / f(c)))

Now if f(c) = frequency = 0.5 and t = repeat_thresh = 1 then r(c) = 1.41

Can someone explain the docstring “If the frequency is half of repeat_thresh, the image will be repeated twice.” to me? Based on the example above i would expect that every image in c to be repeated 1.41 times, not 2.0 as the doc suggests.

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:5

Top GitHub Comments

1reaction

chawinscommented, Apr 22, 2022

Thanks for the answer! In my case, I created a custom RepeatFactorTrainingSampler just without the sqrt().

1reaction

marijnlcommented, Apr 22, 2022

I still dont really understand the docstring but this could be due to my limited understanding of the sampling method. The implementation references LVIS paper appendix B2 which gives a more in depth description of Mask R-CNN with Data Resampling.

If you want you can make the RepeatFactorTrainingSampler balanced by calculating repeat_factors yourself. Maybe thats of any help

import torch
from detectron2.data.samplers.distributed_sampler import RepeatFactorTrainingSampler

num_iterations = 10000
num_images = 10000
class_distribution = [0.8, 0.2]
images_classes = (torch.rand(num_images) > class_distribution[0])*1
# dont use in production! this likely ignores images due to stochastic rounding. Repeat factors should be >= 1
image_level_repeat_factors = torch.tensor([1-class_distribution[class_index] for class_index in images_classes])

sampler = RepeatFactorTrainingSampler(repeat_factors=image_level_repeat_factors)
sampled = []
for i, sample in enumerate(sampler):
    sampled.append(images_classes[sample])
    if i == num_iterations:
        break

print("original distribution", torch.histc(images_classes*1.0, 2)/num_images) # tensor([0.7970, 0.2030])
print("sampled distribution", torch.histc(torch.tensor(sampled)*1.0, 2) / num_iterations) #  tensor([0.4986, 0.5015])