word_tokenizer and Spanish
See original GitHub issueI would like to have a word_tokenizer that works with Spanish. For example, this code:
import nltk from nltk.tokenize import word_tokenize sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?" spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') sentences = spanish_sentence_tokenizer.tokenize(sentences) for s in sentences: print([s for s in vword_tokenize(s)])
gives the following:
['¿Quién', 'eres', 'tú', '?'] ['¡Hola', '!'] ['¿Dónde', 'estoy', '?']
but I would have expected the following instead:
['¿' ,'Quién', 'eres', 'tú', '?'] ['¡' ,'Hola', '!'] ['¿' ,'Dónde', 'estoy', '?']
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
python 3.x - Spanish word tokeniser - Stack Overflow
I would like to tokenise Spanish sentences into words. Is the following the correct approach or is there a better way of doing...
Read more >How to tokenize non english language text in nlp - ProjectPro
word tokenizer - Split the text into words. ... Here we are loading the spanish language tokenizer, and storing it in a variable ......
Read more >nltk.tokenize package
Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for ...
Read more >Detecting POS in Spanish - Text Processing - KNIME Forum
1) When using Flat File Document Parser, for UTF-8 enconded files in Spanish, if you set Word Tokenizer = "Stanford NLP Spanish Tokenizer" ......
Read more >NLP with NLTK Tokenizing Text and WordNet Basics | Kaggle
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') ... Another alternative word tokenizer is WordPunctTokenizer .
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
C.f. #1214, there are quite a few alternative tokenizers in NLTK =)
E.g. using NLTK port of @jonsafari toktok tokenizer:
If you hack the code a little and add
u'\xa1'
in https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51 , you should be able to get:Please look at #1559 and jonsafari/tok-tok#1 .