question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

word_tokenizer and Spanish

See original GitHub issue

I would like to have a word_tokenizer that works with Spanish. For example, this code:

import nltk from nltk.tokenize import word_tokenize sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?" spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') sentences = spanish_sentence_tokenizer.tokenize(sentences) for s in sentences: print([s for s in vword_tokenize(s)])

gives the following:

['¿Quién', 'eres', 'tú', '?'] ['¡Hola', '!'] ['¿Dónde', 'estoy', '?']

but I would have expected the following instead:

['¿' ,'Quién', 'eres', 'tú', '?'] ['¡' ,'Hola', '!'] ['¿' ,'Dónde', 'estoy', '?']

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
alvationscommented, Dec 27, 2016

C.f. #1214, there are quite a few alternative tokenizers in NLTK =)

E.g. using NLTK port of @jonsafari toktok tokenizer:

>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data]     /Users/liling.tan/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /Users/liling.tan/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
True
>>> from nltk.tokenize.toktok import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> sent = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> toktok.tokenize(sent)
[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?', u'\xa1Hola', u'!', u'\xbf', u'D\xf3nde', u'estoy', u'?']
>>> print " ".join(toktok.tokenize(sent))
¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ?

>>> from nltk import sent_tokenize
>>> sentences = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> [toktok.tokenize(sent) for sent in sent_tokenize(sentences, language='spanish')]
[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]

>>> print '\n'.join([' '.join(toktok.tokenize(sent)) for sent in sent_tokenize(sentences, language='spanish')])
¿ Quién eres tú ?
¡Hola !
¿ Dónde estoy ?

If you hack the code a little and add u'\xa1' in https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51 , you should be able to get:

[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1', u'Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]
0reactions
alvationscommented, Dec 27, 2016

Please look at #1559 and jonsafari/tok-tok#1 .

Read more comments on GitHub >

github_iconTop Results From Across the Web

python 3.x - Spanish word tokeniser - Stack Overflow
I would like to tokenise Spanish sentences into words. Is the following the correct approach or is there a better way of doing...
Read more >
How to tokenize non english language text in nlp - ProjectPro
word tokenizer - Split the text into words. ... Here we are loading the spanish language tokenizer, and storing it in a variable ......
Read more >
nltk.tokenize package
Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for ...
Read more >
Detecting POS in Spanish - Text Processing - KNIME Forum
1) When using Flat File Document Parser, for UTF-8 enconded files in Spanish, if you set Word Tokenizer = "Stanford NLP Spanish Tokenizer" ......
Read more >
NLP with NLTK Tokenizing Text and WordNet Basics | Kaggle
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') ... Another alternative word tokenizer is WordPunctTokenizer .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found