question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for French tokenization exceptions

See original GitHub issue

I recently submitted a PR so that spaCy supports tokenization in French. I’m currently working to add the numerous tokenization exceptions existing in French. With the help of wiktionary, I ended up with ~100k tokenization exceptions. You can find here the gzipped exception file.

I first tried to directly load the gzipped file in fr/language_data.py, but it had a major impact on spaCy loading time: it took 2.0s just to import spacy (1.8s spent in fr/language_data.py). By saving the processed TOKENIZER_EXCEPTIONS in a file with numpy.save, and loading it with numpy.load, it only took 0.8s to load fr/language_data.py. It’s still too much, as this file is imported everytime spacy is loaded.

I wanted to have some insights from spaCy developers, in order to find a good solution concerning this tokenizer exceptions problem. One possibility would be to do lazy loading, by transforming French.tokenizer_exceptions into a lazy object which only imports the tokenizer exceptions if some of its methods (such as .items()) are called.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
inescommented, Jan 29, 2017

@raphael0202 Good point, I can take care of adding this to the docs. I guess it should probably go in the “Custom tokenizer exceptions” section.

@oroszgy Since you’ve contributed the token_match functionality, would you be interested in writing up a short explanation (1-2 sentences) for the docs, or maybe a nice, simplified example from Hungarian? We could also do a little table instead that shows regex examples + match examples. I’m happy to do all the formatting/markup for the site, so plain text would be fine.

0reactions
lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenization - CoreNLP - Stanford NLP Group
Description. Tokenization is the process of turning text into tokens. · Tokenization For French, German, and Spanish · Options · Tokenizing From The...
Read more >
Software > Stanford Tokenizer
A tokenizer divides text into a sequence of tokens, which roughly correspond to ... tokenizers FrenchTokenizer and SpanishTokenizer for French and Spanish.
Read more >
Nltk french tokenizer in python not working - Stack Overflow
The issue with this Tokenizer is that it is not an effective tokenizer for french sentences : from nltk.tokenize import word_tokenize ...
Read more >
Creating an index > NLP and tokenization > Exceptions
Exceptions (also known as synonyms) allow to map one or more tokens (including tokens with characters that would normally be excluded) to a...
Read more >
Language Analysis | Apache Solr Reference Guide 8.0
For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found