Support for French tokenization exceptions
See original GitHub issueI recently submitted a PR so that spaCy supports tokenization in French. I’m currently working to add the numerous tokenization exceptions existing in French. With the help of wiktionary, I ended up with ~100k tokenization exceptions. You can find here the gzipped exception file.
I first tried to directly load the gzipped file in fr/language_data.py
, but it had a major impact on spaCy loading time: it took 2.0s just to import spacy
(1.8s spent in fr/language_data.py
).
By saving the processed TOKENIZER_EXCEPTIONS
in a file with numpy.save
, and loading it with numpy.load
, it only took 0.8s to load fr/language_data.py
. It’s still too much, as this file is imported everytime spacy is loaded.
I wanted to have some insights from spaCy developers, in order to find a good solution concerning this tokenizer exceptions problem. One possibility would be to do lazy loading, by transforming French.tokenizer_exceptions
into a lazy object which only imports the tokenizer exceptions if some of its methods (such as .items()
) are called.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:11 (10 by maintainers)
Top GitHub Comments
@raphael0202 Good point, I can take care of adding this to the docs. I guess it should probably go in the “Custom tokenizer exceptions” section.
@oroszgy Since you’ve contributed the
token_match
functionality, would you be interested in writing up a short explanation (1-2 sentences) for the docs, or maybe a nice, simplified example from Hungarian? We could also do a little table instead that shows regex examples + match examples. I’m happy to do all the formatting/markup for the site, so plain text would be fine.This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.