Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

word_tokenize replaces characters

See original GitHub issue

When using the word_tokenize function the quotation marks get replaced with different quotation marks.

Example (german):

import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.

Is this a bug or is there a reasoning behind this behaviour?

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

7reactions

kovvalskycommented, Apr 12, 2020

Altering the original text is not recommended in many applications. I wish the word_tokenize had a flag to turn off altering the text.

2reactions

alvationscommented, May 5, 2017

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn’t indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

Top Results From Across the Web

How to keep special characters together in word_tokenize?

I have NLP problem that involves some coding assignments such as "fn_point->extract.isolate_r" and when I use word_tokenize , the assignment ...

What is NLTK word_tokenize? | How to use? - eduCBA

Nltk word_tokenize is used to extract tokens from a string of characters using the word tokenize method. It actually returns a single word's...

3 Processing Raw Text - NLTK

This step is called tokenization, and it produces our familiar structure, a list of words and punctuation. >>> tokens = nltk.word_tokenize(raw) >>> type ......

What is word_tokenize in Python? - Educative.io

word_tokenize is a function in Python that splits a given sentence into words using ... Some special characters, such as commas, are also...

Regular expressions and word tokenization | Chan`s Jupyter

Find all web links in a document; Parse email addresses, remove/replace unwanted characters. Common Regex patterns ...