Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add parameter for Deformable Convolution offset group scalar value

See original GitHub issue

🚀 Feature

Currently, the scalar used to calculate the number of deformable groups is hardcoded at 2. I would like for a parameter to be added that allows this number to be anything in order to have compatibility with repositories such as EDVR which use 3 for this value.

I have already added it myself and was going to submit a PR before reading that I should submit an issue first.

Motivation

I am currently trying to replace the MMdetection Deformable Convolution v2 with the Torchvision one for the EDVR repository. However, for its offsets, it calculates the out_nc size using this formula: self.deformable_groups * 3 * self.kernel_size[0] * self.kernel_size[1]. The usual formula, which the current Torchvision implementation expects, is self.deformable_groups * 2 * self.kernel_size[0] * self.kernel_size[1]. As you can see, they use a 3 in this calculation instead of a 2. I’m not entirely sure why, but it doesn’t work unless it uses 3.

This causes an issue when using the Torchvision implementation, as in order to calculate the number of offset groups (called deformable groups in the formula above), it requires that scalar value to be 2.

EDVR Formula

Torchvision Formula

Pitch

I would like for a parameter to be added that would allow me to change this value, like so.

Alternatives

Another alternative could be to allow the number of offset groups to be passed in instead of being auto-calculated, as that is what the MMDetection version does.

Additional context

None.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

NicolasHugcommented, Jun 1, 2021

I agree with @fmassa that the 2 in torchvision’s implementation refers to the h and w dimensions.

From section 3.2 of the DeformConv v2 paper https://arxiv.org/abs/1811.11168:

The output is of 3 K channels, where the first 2K channels correspond to the learned offsets ∆pk, and the remaining K channels are further fed to a sigmoid layer to obtain the modulation scalars ∆mk.

where K is self.kernel_size[0] * self.kernel_size[1]. So the difference between 2 and 3 seems to come from the modulation scalars.

I could be wrong as I’m not super familiar with the paper nor the implementation, but I believe those modulation scalars actually correspond to the mask parameter.

I’ll close the issue, please feel free to re-open if there are still some doubts.

1reaction

fmassacommented, Jan 28, 2021

@JoeyBallentine ok, so from my understanding then this was a user error as the shapes of offsets and masks were not correct, so we couldn’t properly infer the number of offset groups.

BTW, I would not recommend calling directly through the torch.ops.torchvision.deform_conv2d as it is an implementation detail and can change at any time without notice. So it might be preferable to fix the code upstream then on relying on internal implementations