Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for French tokenization exceptions

See original GitHub issue

I recently submitted a PR so that spaCy supports tokenization in French. I’m currently working to add the numerous tokenization exceptions existing in French. With the help of wiktionary, I ended up with ~100k tokenization exceptions. You can find here the gzipped exception file.

I first tried to directly load the gzipped file in fr/language_data.py, but it had a major impact on spaCy loading time: it took 2.0s just to import spacy (1.8s spent in fr/language_data.py). By saving the processed TOKENIZER_EXCEPTIONS in a file with numpy.save, and loading it with numpy.load, it only took 0.8s to load fr/language_data.py. It’s still too much, as this file is imported everytime spacy is loaded.

I wanted to have some insights from spaCy developers, in order to find a good solution concerning this tokenizer exceptions problem. One possibility would be to do lazy loading, by transforming French.tokenizer_exceptions into a lazy object which only imports the tokenizer exceptions if some of its methods (such as .items()) are called.

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

inescommented, Jan 29, 2017

@raphael0202 Good point, I can take care of adding this to the docs. I guess it should probably go in the “Custom tokenizer exceptions” section.

@oroszgy Since you’ve contributed the token_match functionality, would you be interested in writing up a short explanation (1-2 sentences) for the docs, or maybe a nice, simplified example from Hungarian? We could also do a little table instead that shows regex examples + match examples. I’m happy to do all the formatting/markup for the site, so plain text would be fine.

0reactions

lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Tokenization - CoreNLP - Stanford NLP Group

Description. Tokenization is the process of turning text into tokens. · Tokenization For French, German, and Spanish · Options · Tokenizing From The...

Software > Stanford Tokenizer

A tokenizer divides text into a sequence of tokens, which roughly correspond to ... tokenizers FrenchTokenizer and SpanishTokenizer for French and Spanish.

Nltk french tokenizer in python not working - Stack Overflow

The issue with this Tokenizer is that it is not an effective tokenizer for french sentences : from nltk.tokenize import word_tokenize ...

Creating an index > NLP and tokenization > Exceptions

Exceptions (also known as synonyms) allow to map one or more tokens (including tokens with characters that would normally be excluded) to a...

Language Analysis | Apache Solr Reference Guide 8.0

For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters.