Verify model confidences manually before 3.0 release
See original GitHub issueWe have run model regression tests in the integration phase of architecture revamp. However, those tests do not check the distribution of confidences that are output by the models. This distribution can be generated by training and testing on a dataset and examining the plot generated in intent_histogram.png
. The examination should check if the confidence distribution of correct and wrong predictions looks “approximately” the same when trained with 2.8.x
and 3.0.0
(They won’t be exactly the same because of some changes that come with 3.0).
The datasets and configs (can be found in training-data repo) on which this should be tested at at the least (covers english and german dataset + configs that are frequently used by customers):
Dataset: public/Sara Configs:
- en/cvf_bert_diet_responset2t.yml
- en/cvf_diet_responset2t.yml
- en/cvf_embedding_responseb2b.yml
- en/cvf_bert_embedding_responseb2b.yml
Dataset: private/service_faq Configs:
- en/cvf_spacy_diet_responset2t.yml
- en/cvf_diet_responset2t.yml
- en/cvf_embedding_responseb2b.yml
- en/cvf_spacy_embedding_responseb2b.yml
Definition of Done:
- Training and evaluation run for the above dataset and config combo using
2.8.x
and a release candidate of3.0.0
/main
branch of Rasa OSS - Verified that
intent_histogram.png
look “approximately similar” in every unique instance of the dataset and config combo.
Issue Analytics
- State:
- Created 2 years ago
- Comments:19 (18 by maintainers)
Exalate commented:
dakshvar22 commented:
Re-running on 2.8.x multiple times on CPU will yield exactly same model confidences (if run on CPU), and hence the confidence histograms will be exactly the same. So, we should really find out what is causing the difference in model confidences when the same config + dataset + CPU machine is used but with
2.8.x
and3.0
installations of rasa.I should emphasize that the differences don’t appear to be large (for e.g. - in
private/service_faq
dataset, the distribution seems to be shifted for 10-11 training examples by a small amount) so it isn’t a high-priority investigation, but nevertheless it should be done at some point in time to prevent an unknown regression causing larger regressions in the future.Exalate commented:
dakshvar22 commented:
@m-vdb @joejuzl There are a few open questions here (apologies for not replying to Kathrin’s question earlier) and I don’t think the issue should be closed. Do you want the conversation to happen anywhere else?