Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistencies with scoring and inference

See original GitHub issue

There are a few inconsistencies related to inference that have come up in evaluating scoring (#538). However, these are corner cases that could arise in other settings, as well. The problem is:

Length normalization is only applied to completed hypotheses on the beam. However, if the beam hits the maximum sequence length before generating </s>, it will still return, with the unnormalized score.
This is a general problem for scoring: when retrieving a score from the Sockeye inference CLI (via --output-type translation_with_score), there is no way to know whether </s> was generated and therefore whether length normalization was applied (since </s> is stripped before returning to the user).
Sockeye’s scores are therefore underspecified.

This is a problem for evaluating scoring. Scoring takes raw text and will therefore always append , just as is done in training. I am running into this problem because sometimes the outputs haven’t actually finished but have just hit the maximum output length and are unnormalized. This could be a problem more generally.

I am not sure of the correct solution, but I propose this:

In inference, --maximum-output-length should refer to the hypothesis excluding </s>. The reasoning is that this flag is a user-facing feature, and users do not see the </s> since it is stripped off.
Beam search stops at the max output length. If there are unfinished hypotheses, however, the decoder should take one more step and force each the selection of </s> for all unfinished hypotheses. It will then apply length normalization, too.
Length normalization should be computed including </s>, since it is generated by the decoder.

This way, the user can be guaranteed that every hypothesis actually finished, and that all obtained scores are comparable.

Issue Analytics

State:
Created 5 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

tdomhancommented, Sep 26, 2018

yeah, that’s a good point. So I guess we could always take the probability of </s> as the last token for truncated hypothesis. This is effectively anyway what we do: we force the model to stop.

0reactions

fhiebercommented, Aug 29, 2019

Sockeye 2 addresses this issue, see #719.