Flexible handling of non-existent message_attribute for given training sample
See original GitHub issueDescription of Problem:
Currently, the SpacyNLP
component provides the following:
provides = ["spacy_doc", "spacy_nlp", "intent_spacy_doc", "response_spacy_doc"]
which caused the necessity to handle non-existent / None
-valued attributes for a given training sample. Currently this is realized by converting None
values to empty strings since spaCy can’t handle None
values while creating its Doc
-objects upon them.
Since simply filtering out those training samples and therefore disobey their order would cause consecutive problems, we need to find a more flexible solution.
Overview of the Solution: I am going to think about a robust solution and update this issue likewise.
Examples:
If there are no samples for the response
-attribute, currently this results in a list of empty Doc
-objects while calling pipe
on:
docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]
[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]
since an empty string is valid for Doc
-objects but in fact is a problem for e.g. libraries like spacy-pytorch-transformers
or other custom-components which can’t handle this cases properly.
The coresponding forum entry to this conversation can be found here @dakshvar22
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Hi @dakshvar22,
allright - I agree with you and I am going to start to work on this this afternoon. I will get back to you with a code proposal as soon as it is ready.
Thanks for your help!
Regards Julian
Closing as this is in a minor release around 1.3.x.