Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

word_tokenizer and Spanish

See original GitHub issue

I would like to have a word_tokenizer that works with Spanish. For example, this code:

import nltk from nltk.tokenize import word_tokenize sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?" spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') sentences = spanish_sentence_tokenizer.tokenize(sentences) for s in sentences: print([s for s in vword_tokenize(s)])

gives the following:

['¿Quién', 'eres', 'tú', '?'] ['¡Hola', '!'] ['¿Dónde', 'estoy', '?']

but I would have expected the following instead:

['¿' ,'Quién', 'eres', 'tú', '?'] ['¡' ,'Hola', '!'] ['¿' ,'Dónde', 'estoy', '?']

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

alvationscommented, Dec 27, 2016

C.f. #1214, there are quite a few alternative tokenizers in NLTK =)

E.g. using NLTK port of @jonsafari toktok tokenizer:

>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data]     /Users/liling.tan/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /Users/liling.tan/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
True
>>> from nltk.tokenize.toktok import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> sent = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> toktok.tokenize(sent)
[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?', u'\xa1Hola', u'!', u'\xbf', u'D\xf3nde', u'estoy', u'?']
>>> print " ".join(toktok.tokenize(sent))
¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ?

>>> from nltk import sent_tokenize
>>> sentences = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> [toktok.tokenize(sent) for sent in sent_tokenize(sentences, language='spanish')]
[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]

>>> print '\n'.join([' '.join(toktok.tokenize(sent)) for sent in sent_tokenize(sentences, language='spanish')])
¿ Quién eres tú ?
¡Hola !
¿ Dónde estoy ?

If you hack the code a little and add u'\xa1' in https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51 , you should be able to get:

[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1', u'Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]

0reactions

alvationscommented, Dec 27, 2016

Please look at #1559 and jonsafari/tok-tok#1 .

Top Results From Across the Web

python 3.x - Spanish word tokeniser - Stack Overflow

I would like to tokenise Spanish sentences into words. Is the following the correct approach or is there a better way of doing...

How to tokenize non english language text in nlp - ProjectPro

word tokenizer - Split the text into words. ... Here we are loading the spanish language tokenizer, and storing it in a variable ......

nltk.tokenize package

Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for ...

Detecting POS in Spanish - Text Processing - KNIME Forum

1) When using Flat File Document Parser, for UTF-8 enconded files in Spanish, if you set Word Tokenizer = "Stanford NLP Spanish Tokenizer" ......

NLP with NLTK Tokenizing Text and WordNet Basics | Kaggle

spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') ... Another alternative word tokenizer is WordPunctTokenizer .