Alternative to tokenizer.tokens_from_list in parser init
See original GitHub issueHello again 😃
In my spacy_conll
library I implement a utility function to easily initialise a spacy-based parser (spacy, spacy-udpipe, spacy-stanza). It returns the parser (Language
object). Easily being able to plug-and play the tokenizer to set it to tokens_from_list is incredibly useful because it allows me to write something like this generic initialisation where stanza as well as spacy can be set to use tokens as input:
is_tokenized = True
if parser == "spacy":
nlp = spacy.load(model_or_lang, **parser_opts)
if is_tokenized:
nlp.tokenizer = nlp.tokenizer.tokens_from_list
elif parser == "stanza":
import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(
lang=model_or_lang, tokenize_pretokenized=is_tokenized, **parser_opts
)
nlp = StanzaLanguage(snlp)
This works great and exactly as I would want. However, this tokens_from_list
function is deprecated:
Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the
words
keyword argument, for example: from spacy.tokens import Doc doc = Doc(nlp.vocab, words=[…]) “main”, mod_spec)
It would be great if we could still initialise the parser, immediately telling it to use pre-tokenized text. That makes the manual creation of Doc-objects unnecessary and seems more user-friendly. Moving this setting from parser-init to the moment of init is, in my case at least, cumbersome. If I understand correctly, you’d need to manually create the Doc object and then push it through the pipeline, too. That is a lot of user-interaction that was previously not needed for this.
Now, I am aware that my case may be an exception, so I am especially looking for an alternative approach at parser-init time, but that is not necessarily built-in. If possible, I would be happy with using a custom function or subclassing Language
to make this work (though I’d prefer not too).
Tl;dr: what has changed that the tokens_from_list
will be removed? And can you give me pointers to look into if I still wanted to make it clear to the parser that its input is already tokenised? I could subclass and overwrite call, but I don’t know if that’s perhaps too destructive.
As a suggestion, perhaps pretokenized
can be a property of Language and make_doc can be modified like so:
def make_doc(self, text):
if self.pretokenized:
words = text.split()
spaces = [True] * (len(words)-1) + ([True] if text[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
else:
return self.tokenizer(text)
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
Hi Adriane
I have implemented it as follows and it works as expected. Thanks!
I don’t actually know for sure why this was deprecated (my guess is something related to whitespace handling?), but just as you use:
nlp.tokenizer = nlp.tokenizer.tokens_from_list
You can substitute any custom tokenizer that does the correct
input -> Doc
conversion with the correct vocab fornlp.tokenizer
:It’s just the list (vs. whitespace) version of this example: https://spacy.io/usage/linguistic-features#custom-tokenizer-example
Serialization is still a bit of an issue (the
DummyTokenizer
provides dummy serialization functions so you can runspacy.to_disk()
without errors, but it doesn’t actually save anything about this tokenizer), so you can’t reload this exact pipeline withspacy.load()
, but I don’t think that’s an issue in your context?