Tokenizer splitting
See original GitHub issueI have a question/suggestion re tokenizer class and custom tokenizers. I think it would be great to have the ability to have a custom split besides spaces and also to include new lines. Here are a couple of examples where this is an issue:
In [19]: tokens = nlp("I like green,blue and purple:)")
In [20]: for t in tokens:
print('|'+t.string+'|', t.pos_)
....:
|I | PRON
|like | VERB
|green,blue | ADJ
|and | CONJ
|purple| ADJ
|:| PUNCT
|)| PUNCT
and
In [21]: tokens = nlp("I like:\ngreen\nblue\npurple\n:)")
In [22]: for t in tokens:
print('|'+t.string+'|', t.pos_)
....:
|I | PRON
|like| VERB
|:| PUNCT
|
| ADV
|green| ADJ
|
| ADJ
|blue| ADJ
|
| ADJ
|purple| ADJ
|
| NOUN
|:)| PUNCT
Ideally we would retrain on a dataset that has new lines in it without stripping those and then label the new lines as such. Also, in “online writing” in many cases people tend to skip spaces when using punctuation. I am not sure if there are already any pre-tagged datasets where this is the case, but it would help a lot.
So the question is: what would be the easiest way to integrate these into existing code? So far i’m doing a workaround where I insert spaces if there isn’t one already after a comma, but it feels dirty, and I’m not sure if I should be replacing new lines by spaces, because a good amount of information is lost in foregoing the distinction.
Thanks for the library by the way.
Issue Analytics
- State:
- Created 8 years ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
The next round of changes makes this much easier, but there are a number of ways you could achieve this currently.
The key method is tokenizer.tokens_from_list, which lets you specify the tokenization. The problem with this is that the pipeline is inherently coupled, because of the statistical models. (NLP pipelines always are. “Swappable components” is a lie, unless you retrain everything.)
Changing things from how the models were trained degrades performance.
If you have the training data, you can tokenize however you like. spaCy is able to take raw text, and compute a Levenshtein alignment to the tokenization in a treebank. Misalignments between its tokenizer and the tokens in the gold-standard are treated as ambiguous examples (multiple answers could be correct). See bin/parser/train.py for details of this.
If you’re just changing a few things, it probably won’t hurt the model much. For now I’d try something like this:
This wraps the tokenizer with a little function that steps in and provides your list of strings.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.