Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions on DAM creation

See original GitHub issue

Hi! Thank you for releasing such a wonderful work. How DAM is generated was a bit unclear to me when reading the paper. Assuming there are N tokens in total from the encoder (considering one feature level, then N = H x W), and M object queries:

Regarding “In the case of the dense attention, DAM can be easily obtained by summing up attention maps from every decoder layer”, do you mean the cross-attention with shape N x M?
Regarding “produces a single map of the same size as the feature map from the backbone”, how is this achieved? Could you help walk through the calculation and the shapes of the tensors?
Why not directly use the DAM to select the top-k tokens and why have a separate scoring network?

Thanks! I look forward to your reply.

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

2reactions

kinredoncommented, Mar 18, 2022

In other words, for the query q, it references (x1, y1) by A * G((x,y), (x1, y1)). (I think we are having different understanding here).

Yeah, I get this now. The goal is to obtain the attention_weights of reference points for each query, instead of the attention_weights * value of reference points.

@JWoong-Shin Thanks for your quick response.

0reactions

JWoong-Lunitcommented, Mar 18, 2022

the goal of DAM is to accumulate the attention of token i.e, (x1, y1), (x1, y2), (x2, y1), (x2, y2) from sampling_locations, i.e. (x, y).

Yes, for some query q, it will obtain value by A * G((x,y), (x1, y1)) * v(x1, y1) + ..., where A is attention_weights, G is a bilinear interpolation kernel, and v is the value at the point (x1, y1).

In other words, for the query q, it references (x1, y1) by A * G((x,y), (x1, y1)). (I think we are having different understanding here).

Therefore, in the perspective of gridpoint (x1, y1), the DAM value is accumulated by A * G((x,y), (x1, y1)) for the query q, and summing over every query, DAM is created. (Sum is not conducted inside the attn_map_to_flat_grid method. The method obtains interpolated attention weights in the grid shape, and then obtain DAM by summing over decoder queries and decoder layers: https://github.com/kakaobrain/sparse-detr/blob/1ea7a062ca6d1dd57768d65b14352cfd1a65ab52/models/deformable_detr.py#L408-L409)