Inconsistencies with scoring and inference
See original GitHub issueThere are a few inconsistencies related to inference that have come up in evaluating scoring (#538). However, these are corner cases that could arise in other settings, as well. The problem is:
- Length normalization is only applied to completed hypotheses on the beam. However, if the beam hits the maximum sequence length before generating
</s>, it will still return, with the unnormalized score. - This is a general problem for scoring: when retrieving a score from the Sockeye inference CLI (via
--output-type translation_with_score), there is no way to know whether</s>was generated and therefore whether length normalization was applied (since</s>is stripped before returning to the user). - Sockeye’s scores are therefore underspecified.
This is a problem for evaluating scoring. Scoring takes raw text and will therefore always append , just as is done in training. I am running into this problem because sometimes the outputs haven’t actually finished but have just hit the maximum output length and are unnormalized. This could be a problem more generally.
I am not sure of the correct solution, but I propose this:
- In inference,
--maximum-output-lengthshould refer to the hypothesis excluding</s>. The reasoning is that this flag is a user-facing feature, and users do not see the</s>since it is stripped off. - Beam search stops at the max output length. If there are unfinished hypotheses, however, the decoder should take one more step and force each the selection of
</s>for all unfinished hypotheses. It will then apply length normalization, too. - Length normalization should be computed including
</s>, since it is generated by the decoder.
This way, the user can be guaranteed that every hypothesis actually finished, and that all obtained scores are comparable.
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Errors in Statistical Inference Under Model Misspecification
The error rates in evidential analysis all decrease to 0 as sample size increases even under model misspecification. Neyman-Pearson testing on ...
Read more >Methods for correcting inference based on outcomes ... - PNAS
We propose a statistical adjustment to correct biased inference in regression models using predicted outcomes—regardless of the machine-learning ...
Read more >Inference under discrepancy - Richard Wilkinson
inference under discrepancy? Particularly when the discrepancy model is crude? Consistency? ▻ I don't want inconsistency.
Read more >10 Common Machine Learning Mistakes and How to Avoid ...
To help your machine learning projects succeed, here is how to identify and avoid ten common machine learning pitfalls that can impact your ......
Read more >Inference vs Prediction - Data Science Blog
Many people use prediction and inference synonymously although there is a subtle difference. Learn what it is here!
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

yeah, that’s a good point. So I guess we could always take the probability of
</s>as the last token for truncated hypothesis. This is effectively anyway what we do: we force the model to stop.Sockeye 2 addresses this issue, see #719.