Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer splitting

See original GitHub issue

I have a question/suggestion re tokenizer class and custom tokenizers. I think it would be great to have the ability to have a custom split besides spaces and also to include new lines. Here are a couple of examples where this is an issue:

In [19]: tokens = nlp("I like green,blue and purple:)")

In [20]: for t in tokens:
    print('|'+t.string+'|', t.pos_)
   ....:     
|I | PRON
|like | VERB
|green,blue | ADJ
|and | CONJ
|purple| ADJ
|:| PUNCT
|)| PUNCT

and

In [21]: tokens = nlp("I like:\ngreen\nblue\npurple\n:)")
In [22]: for t in tokens:
    print('|'+t.string+'|', t.pos_)
   ....:     
|I | PRON
|like| VERB
|:| PUNCT
|
| ADV
|green| ADJ
|
| ADJ
|blue| ADJ
|
| ADJ
|purple| ADJ
|
| NOUN
|:)| PUNCT

Ideally we would retrain on a dataset that has new lines in it without stripping those and then label the new lines as such. Also, in “online writing” in many cases people tend to skip spaces when using punctuation. I am not sure if there are already any pre-tagged datasets where this is the case, but it would help a lot.

So the question is: what would be the easiest way to integrate these into existing code? So far i’m doing a workaround where I insert spaces if there isn’t one already after a comma, but it feels dirty, and I’m not sure if I should be replacing new lines by spaces, because a good amount of information is lost in foregoing the distinction.

Thanks for the library by the way.

Issue Analytics

State:
Created 8 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

1reaction

honnibalcommented, Aug 28, 2015

The next round of changes makes this much easier, but there are a number of ways you could achieve this currently.

The key method is tokenizer.tokens_from_list, which lets you specify the tokenization. The problem with this is that the pipeline is inherently coupled, because of the statistical models. (NLP pipelines always are. “Swappable components” is a lie, unless you retrain everything.)

Changing things from how the models were trained degrades performance.

If you have the training data, you can tokenize however you like. spaCy is able to take raw text, and compute a Levenshtein alignment to the tokenization in a treebank. Misalignments between its tokenizer and the tokens in the gold-standard are treated as ambiguous examples (multiple answers could be correct). See bin/parser/train.py for details of this.

If you’re just changing a few things, it probably won’t hurt the model much. For now I’d try something like this:

def replace_tokenizer(nlp, my_split_function):
    old_tokenizer = nlp.tokenizer 
    nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))

This wraps the tokenizer with a little function that steps in and provides your list of strings.

0reactions

lock[bot]commented, May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Simple pattern split tokenizer | Elasticsearch Guide [8.5] | Elastic

The simple_pattern_split tokenizer uses a regular expression to split the input into terms at pattern matches. The set of regular expression features it ......

Summary of the tokenizers - Hugging Face

As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...

Difference Between StringTokenizer and Split Method in Java

Difference Between StringTokenizer and Split Method in Java ; It is a legacy class that allows an application to break a string into...

In HuggingFace tokenizers: how can I split a sequence simply ...

The key is to split on tokens first using BasicTokenizer (as proposed by @cronoik) and then use already tokenized text when encoding it....

Tokenization & Sentence Segmentation - Stanza

This processor splits the raw input text into tokens and sentences, so that downstream annotation can happen at the sentence level. This processor...