question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Alternative to tokenizer.tokens_from_list in parser init

See original GitHub issue

Hello again 😃

In my spacy_conll library I implement a utility function to easily initialise a spacy-based parser (spacy, spacy-udpipe, spacy-stanza). It returns the parser (Language object). Easily being able to plug-and play the tokenizer to set it to tokens_from_list is incredibly useful because it allows me to write something like this generic initialisation where stanza as well as spacy can be set to use tokens as input:

is_tokenized = True
if parser == "spacy":
    nlp = spacy.load(model_or_lang, **parser_opts)
    if is_tokenized:
        nlp.tokenizer = nlp.tokenizer.tokens_from_list
elif parser == "stanza":
    import stanza
    from spacy_stanza import StanzaLanguage

    snlp = stanza.Pipeline(
        lang=model_or_lang, tokenize_pretokenized=is_tokenized, **parser_opts
    )
    nlp = StanzaLanguage(snlp)

This works great and exactly as I would want. However, this tokens_from_list function is deprecated:

Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the words keyword argument, for example: from spacy.tokens import Doc doc = Doc(nlp.vocab, words=[…]) “main”, mod_spec)

It would be great if we could still initialise the parser, immediately telling it to use pre-tokenized text. That makes the manual creation of Doc-objects unnecessary and seems more user-friendly. Moving this setting from parser-init to the moment of init is, in my case at least, cumbersome. If I understand correctly, you’d need to manually create the Doc object and then push it through the pipeline, too. That is a lot of user-interaction that was previously not needed for this.

https://github.com/explosion/spaCy/blob/c045a9c7f637f85f7beccdae48a4cb765516d558/spacy/language.py#L435-L442

Now, I am aware that my case may be an exception, so I am especially looking for an alternative approach at parser-init time, but that is not necessarily built-in. If possible, I would be happy with using a custom function or subclassing Language to make this work (though I’d prefer not too).

Tl;dr: what has changed that the tokens_from_list will be removed? And can you give me pointers to look into if I still wanted to make it clear to the parser that its input is already tokenised? I could subclass and overwrite call, but I don’t know if that’s perhaps too destructive.

As a suggestion, perhaps pretokenized can be a property of Language and make_doc can be modified like so:

def make_doc(self, text):
    if self.pretokenized:
        words = text.split()
        spaces = [True] * (len(words)-1) + ([True] if text[-1].isspace() else [False])
        return Doc(self.vocab, words=words, spaces=spaces)
    else:
        return self.tokenizer(text)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
BramVanroycommented, May 5, 2020

Hi Adriane

I have implemented it as follows and it works as expected. Thanks!

class _PretokenizedTokenizer:
    """Custom tokenizer to be used in spaCy when the text is already pretokenized."""
    def __init__(self, vocab: Vocab):
        """Initialize tokenizer with a given vocab
        :param vocab: an existing vocabulary (see https://spacy.io/api/vocab)
        """
        self.vocab = vocab

    def __call__(self, inp: Union[List[str], str]) -> Doc:
        """Call the tokenizer on input `inp`.
        :param inp: either a string to be split on whitespace, or a list of tokens
        :return: the created Doc object
        """
        if isinstance(inp, str):
            words = inp.split()
            spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
            return Doc(self.vocab, words=words, spaces=spaces)
        elif isinstance(inp, list):
            return Doc(self.vocab, words=inp)
        else:
            raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")
1reaction
adrianeboydcommented, May 4, 2020

I don’t actually know for sure why this was deprecated (my guess is something related to whitespace handling?), but just as you use:

nlp.tokenizer = nlp.tokenizer.tokens_from_list

You can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:

class CustomTokenizer(DummyTokenizer):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, words):
        return Doc(self.vocab, words=words)

nlp.tokenizer = CustomTokenizer(nlp.vocab)
doc = nlp(["This", "is", "a", "sentence", "."])

It’s just the list (vs. whitespace) version of this example: https://spacy.io/usage/linguistic-features#custom-tokenizer-example

Serialization is still a bit of an issue (the DummyTokenizer provides dummy serialization functions so you can run spacy.to_disk() without errors, but it doesn’t actually save anything about this tokenizer), so you can’t reload this exact pipeline with spacy.load(), but I don’t think that’s an issue in your context?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using nlp.pipe() with pre-segmented and pre-tokenized text ...
Just replace the default tokenizer in the pipeline with nlp.tokenizer.tokens_from_list instead of calling it separately:
Read more >
What's New in v3.0 · spaCy Usage Documentation
Package, Language, Transformer, Tagger, Parser, NER ... You can use the quickstart widget or the init config command to get started ... Removed,...
Read more >
Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >
tokenize — Tokenizer for Python source — Python 3.11.1 ...
Source code: Lib/tokenize.py The tokenize module provides a lexical ... def decistmt(s): """Substitute Decimals for floats in a string of statements.
Read more >
Building a PEG Parser - Medium
The result may not be a great general-purpose PEG parser generator — there are already ... class Tokenizer: def __init__(self, tokengen):
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found