question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Explanation of the RepeatFactorTrainingSampler.repeat_factors_from_category_frequency(`repeat_thresh`)

See original GitHub issue

📚 RepeatFactorTrainingSampler.repeat_factors_from_category_frequency

From the docs repeat_factors_from_category_frequency

repeat_thresh (float) – frequency threshold below which data is repeated. If the frequency is half of repeat_thresh, the image will be repeated twice.

In the source code i find these lines:

# 2. For each category c, compute the category-level repeat factor:
#    r(c) = max(1, sqrt(t / f(c)))

Now if f(c) = frequency = 0.5 and t = repeat_thresh = 1 then r(c) = 1.41

Can someone explain the docstring “If the frequency is half of repeat_thresh, the image will be repeated twice.” to me? Based on the example above i would expect that every image in c to be repeated 1.41 times, not 2.0 as the doc suggests.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:5

github_iconTop GitHub Comments

1reaction
chawinscommented, Apr 22, 2022

Thanks for the answer! In my case, I created a custom RepeatFactorTrainingSampler just without the sqrt().

1reaction
marijnlcommented, Apr 22, 2022

I still dont really understand the docstring but this could be due to my limited understanding of the sampling method. The implementation references LVIS paper appendix B2 which gives a more in depth description of Mask R-CNN with Data Resampling.

If you want you can make the RepeatFactorTrainingSampler balanced by calculating repeat_factors yourself. Maybe thats of any help

import torch
from detectron2.data.samplers.distributed_sampler import RepeatFactorTrainingSampler

num_iterations = 10000
num_images = 10000
class_distribution = [0.8, 0.2]
images_classes = (torch.rand(num_images) > class_distribution[0])*1
# dont use in production! this likely ignores images due to stochastic rounding. Repeat factors should be >= 1
image_level_repeat_factors = torch.tensor([1-class_distribution[class_index] for class_index in images_classes])

sampler = RepeatFactorTrainingSampler(repeat_factors=image_level_repeat_factors)
sampled = []
for i, sample in enumerate(sampler):
    sampled.append(images_classes[sample])
    if i == num_iterations:
        break

print("original distribution", torch.histc(images_classes*1.0, 2)/num_images) # tensor([0.7970, 0.2030])
print("sampled distribution", torch.histc(torch.tensor(sampled)*1.0, 2) / num_iterations) #  tensor([0.4986, 0.5015])
Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found