Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Verify model confidences manually before 3.0 release

See original GitHub issue

Slack Thread

We have run model regression tests in the integration phase of architecture revamp. However, those tests do not check the distribution of confidences that are output by the models. This distribution can be generated by training and testing on a dataset and examining the plot generated in intent_histogram.png. The examination should check if the confidence distribution of correct and wrong predictions looks “approximately” the same when trained with 2.8.x and 3.0.0 (They won’t be exactly the same because of some changes that come with 3.0).

The datasets and configs (can be found in training-data repo) on which this should be tested at at the least (covers english and german dataset + configs that are frequently used by customers):

Dataset: public/Sara Configs:

en/cvf_bert_diet_responset2t.yml
en/cvf_diet_responset2t.yml
en/cvf_embedding_responseb2b.yml
en/cvf_bert_embedding_responseb2b.yml

Dataset: private/service_faq Configs:

en/cvf_spacy_diet_responset2t.yml
en/cvf_diet_responset2t.yml
en/cvf_embedding_responseb2b.yml
en/cvf_spacy_embedding_responseb2b.yml

Definition of Done:

Training and evaluation run for the above dataset and config combo using 2.8.x and a release candidate of 3.0.0 / main branch of Rasa OSS
Verified that intent_histogram.png look “approximately similar” in every unique instance of the dataset and config combo.

Issue Analytics

State:
Created 2 years ago
Comments:19 (18 by maintainers)

Top GitHub Comments

3reactions

dakshvar22commented, Mar 17, 2022

Exalate commented:

dakshvar22 commented:

Are you saying that the difference between 2.8.x and 3.0.x is more than the difference expected from e.g. re-running on 2.8.x?

Re-running on 2.8.x multiple times on CPU will yield exactly same model confidences (if run on CPU), and hence the confidence histograms will be exactly the same. So, we should really find out what is causing the difference in model confidences when the same config + dataset + CPU machine is used but with 2.8.x and 3.0 installations of rasa.

I should emphasize that the differences don’t appear to be large (for e.g. - in private/service_faq dataset, the distribution seems to be shifted for 10-11 training examples by a small amount) so it isn’t a high-priority investigation, but nevertheless it should be done at some point in time to prevent an unknown regression causing larger regressions in the future.

1reaction

dakshvar22commented, Mar 17, 2022

Exalate commented:

dakshvar22 commented:

@m-vdb @joejuzl There are a few open questions here (apologies for not replying to Kathrin’s question earlier) and I don’t think the issue should be closed. Do you want the conversation to happen anywhere else?