Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add provenance to examples in NLU training data

See original GitHub issue

Description of Problem: Currently, the examples for NLU training data are all added under one head examples like this -

nlu:

intent: greet examples: |
hey
hi
whats up
Hello, how’s it going?

This does not indicate which training examples were added by the builder of the assistant and which of them were added through annotations in NLU Inbox and hence coming from real conversations. In summary, there is no provenance of examples recorded in the training data.

If this distinction is made available in the NLU training file, a lot of tooling can be built to make CDD more efficient. For example - adding a flag in rasa nlu train which lets the developer specify the ratio of examples from real v/s non-real conversations to be picked up for downstream model training. Another example - bot builders can then actually just eyeball their training data and see how do actual user messages differ from messages they added.

Overview of the Solution: The distinction between messages from real conversations v/s cooked up examples is already available in Rasa X inside the NLU Inbox. It would be beneficial if this distinction can be made clearer in the NLU training file as well. So have something like -

nlu:

intent: greet examples_from_builder: |
hey
hi
whats up examples_from_users: |
Hello, how’s it going?

Slack thread of an ongoing discussion on this

Summarizing thoughts from different people here below -

@TyDunn shared his opinion on how this can further motivate people to add more examples by their own because if we are making the distinction clearer and creating a dedicated section for it then developers feel the need to fill up that section even more. This can discourage developers from adding more data from real conversations.

@amn41 suggested that if this distinction is made clearer we can let developers know the ‘health’ of their training data during training based on number of examples coming from real v/s non-real conversations. More the number of examples from real conversations, the healthier your training data is.

@philippwolf1 mentioned that this has come up a couple of times with prospects where they want to understand which auto-generated data is simply not representative of the data real conversations have.

My personal take goes a bit further -

This also lets us augment the existing training data with more examples(for example through paraphrasing) but kept in a separate section and hence the downstream training can intelligently pick up the ratio of data points from each section to make the model more robust.
It also helps users see the benefit of CDD more swiftly than they can normally do right now. If they start building their assistant with let’s say 2000 training examples, it still takes some time for the CDD process to match those number of examples in the training data. Even when you have 2000 more examples from real conversations the downstream models are still suffering from the non-real messages and it is better to have a way to drop them off during model training that use all of them combined. This distinction would again enable that. See the slack thread for a concrete real world example on this.
If users still feel compelled to come up with training examples by themselves, we may need to further motivate them to show the benefits of CDD.

Since we have adopted yml as the training data format we can simply add more attributes to an object without breaking the currently adopted training data format, for e.g. -

intent: greet examples: |
hey
hi
whats up example_source: users examples: |
Hi bot!
Hello sir
Top of the morning example_source: augmentation examples: |
Hola!

Issue Analytics

State:
Created 3 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

TyDunncommented, Mar 17, 2022

Exalate commented:

TyDunn commented:

@dakshvar22 We did not. Let’s put it into production inbox / 2.2 to start the discussion again

1reaction

dakshvar22commented, Mar 17, 2022

Exalate commented:

dakshvar22 commented:

Thanks @degiz . The idea is to use this “source of data” to influence the training going forward in future features which are aligned with the concept of CDD. I see this as a prerequisite for it.

Top Results From Across the Web

NLU Training Data - Rasa

NLU training data consists of example user utterances categorized by intent. To make it easier to use your intents, give them names that ......

Slot Filling - Papers With Code

In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific...

Training Dataset for chatbots/Virtual Assistants - Kaggle

This dataset contains example utterances and their corresponding intents from ... to train intent recognition models Natural Language Understanding (NLU) ...

How to build your own chatbot intent classifier - Xatkit

Preparing the data for the training · We assign a numeric value to each intent and we use that number when populating the ......

PaLM: Scaling Language Modeling with Pathways - arXiv

C Dataset Analysis. 67. D Datasheet. 69. E Model Card. 74. F Training for longer. 76. G Sample Model Outputs. 78. G.1 Reasoning...