Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Alternative to tokenizer.tokens_from_list in parser init

See original GitHub issue

Hello again 😃

In my spacy_conll library I implement a utility function to easily initialise a spacy-based parser (spacy, spacy-udpipe, spacy-stanza). It returns the parser (Language object). Easily being able to plug-and play the tokenizer to set it to tokens_from_list is incredibly useful because it allows me to write something like this generic initialisation where stanza as well as spacy can be set to use tokens as input:

is_tokenized = True
if parser == "spacy":
    nlp = spacy.load(model_or_lang, **parser_opts)
    if is_tokenized:
        nlp.tokenizer = nlp.tokenizer.tokens_from_list
elif parser == "stanza":
    import stanza
    from spacy_stanza import StanzaLanguage

    snlp = stanza.Pipeline(
        lang=model_or_lang, tokenize_pretokenized=is_tokenized, **parser_opts
    )
    nlp = StanzaLanguage(snlp)

This works great and exactly as I would want. However, this tokens_from_list function is deprecated:

Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the words keyword argument, for example: from spacy.tokens import Doc doc = Doc(nlp.vocab, words=[…]) “main”, mod_spec)

It would be great if we could still initialise the parser, immediately telling it to use pre-tokenized text. That makes the manual creation of Doc-objects unnecessary and seems more user-friendly. Moving this setting from parser-init to the moment of init is, in my case at least, cumbersome. If I understand correctly, you’d need to manually create the Doc object and then push it through the pipeline, too. That is a lot of user-interaction that was previously not needed for this.

https://github.com/explosion/spaCy/blob/c045a9c7f637f85f7beccdae48a4cb765516d558/spacy/language.py#L435-L442

Now, I am aware that my case may be an exception, so I am especially looking for an alternative approach at parser-init time, but that is not necessarily built-in. If possible, I would be happy with using a custom function or subclassing Language to make this work (though I’d prefer not too).

Tl;dr: what has changed that the tokens_from_list will be removed? And can you give me pointers to look into if I still wanted to make it clear to the parser that its input is already tokenised? I could subclass and overwrite call, but I don’t know if that’s perhaps too destructive.

As a suggestion, perhaps pretokenized can be a property of Language and make_doc can be modified like so:

def make_doc(self, text):
    if self.pretokenized:
        words = text.split()
        spaces = [True] * (len(words)-1) + ([True] if text[-1].isspace() else [False])
        return Doc(self.vocab, words=words, spaces=spaces)
    else:
        return self.tokenizer(text)

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

4reactions

BramVanroycommented, May 5, 2020

Hi Adriane

I have implemented it as follows and it works as expected. Thanks!

class _PretokenizedTokenizer:
    """Custom tokenizer to be used in spaCy when the text is already pretokenized."""
    def __init__(self, vocab: Vocab):
        """Initialize tokenizer with a given vocab
        :param vocab: an existing vocabulary (see https://spacy.io/api/vocab)
        """
        self.vocab = vocab

    def __call__(self, inp: Union[List[str], str]) -> Doc:
        """Call the tokenizer on input `inp`.
        :param inp: either a string to be split on whitespace, or a list of tokens
        :return: the created Doc object
        """
        if isinstance(inp, str):
            words = inp.split()
            spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
            return Doc(self.vocab, words=words, spaces=spaces)
        elif isinstance(inp, list):
            return Doc(self.vocab, words=inp)
        else:
            raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")

1reaction

adrianeboydcommented, May 4, 2020

I don’t actually know for sure why this was deprecated (my guess is something related to whitespace handling?), but just as you use:

nlp.tokenizer = nlp.tokenizer.tokens_from_list

You can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:

class CustomTokenizer(DummyTokenizer):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, words):
        return Doc(self.vocab, words=words)

nlp.tokenizer = CustomTokenizer(nlp.vocab)
doc = nlp(["This", "is", "a", "sentence", "."])

It’s just the list (vs. whitespace) version of this example: https://spacy.io/usage/linguistic-features#custom-tokenizer-example

Serialization is still a bit of an issue (the DummyTokenizer provides dummy serialization functions so you can run spacy.to_disk() without errors, but it doesn’t actually save anything about this tokenizer), so you can’t reload this exact pipeline with spacy.load(), but I don’t think that’s an issue in your context?