question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

word_tokenize replaces characters

See original GitHub issue

When using the word_tokenize function the quotation marks get replaced with different quotation marks.

Example (german):

import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.

Is this a bug or is there a reasoning behind this behaviour?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

7reactions
kovvalskycommented, Apr 12, 2020

Altering the original text is not recommended in many applications. I wish the word_tokenize had a flag to turn off altering the text.

2reactions
alvationscommented, May 5, 2017

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn’t indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to keep special characters together in word_tokenize?
I have NLP problem that involves some coding assignments such as "fn_point->extract.isolate_r" and when I use word_tokenize , the assignment ...
Read more >
What is NLTK word_tokenize? | How to use? - eduCBA
Nltk word_tokenize is used to extract tokens from a string of characters using the word tokenize method. It actually returns a single word's...
Read more >
3 Processing Raw Text - NLTK
This step is called tokenization, and it produces our familiar structure, a list of words and punctuation. >>> tokens = nltk.word_tokenize(raw) >>> type ......
Read more >
What is word_tokenize in Python? - Educative.io
word_tokenize is a function in Python that splits a given sentence into words using ... Some special characters, such as commas, are also...
Read more >
Regular expressions and word tokenization | Chan`s Jupyter
Find all web links in a document; Parse email addresses, remove/replace unwanted characters. Common Regex patterns ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found