word_tokenize replaces characters
See original GitHub issueWhen using the word_tokenize function the quotation marks get replaced with different quotation marks.
Example (german):
import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.
Is this a bug or is there a reasoning behind this behaviour?
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
How to keep special characters together in word_tokenize?
I have NLP problem that involves some coding assignments such as "fn_point->extract.isolate_r" and when I use word_tokenize , the assignment ...
Read more >What is NLTK word_tokenize? | How to use? - eduCBA
Nltk word_tokenize is used to extract tokens from a string of characters using the word tokenize method. It actually returns a single word's...
Read more >3 Processing Raw Text - NLTK
This step is called tokenization, and it produces our familiar structure, a list of words and punctuation. >>> tokens = nltk.word_tokenize(raw) >>> type ......
Read more >What is word_tokenize in Python? - Educative.io
word_tokenize is a function in Python that splits a given sentence into words using ... Some special characters, such as commas, are also...
Read more >Regular expressions and word tokenization | Chan`s Jupyter
Find all web links in a document; Parse email addresses, remove/replace unwanted characters. Common Regex patterns ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Altering the original text is not recommended in many applications. I wish the
word_tokenize
had a flag to turn off altering the text.@mwess After some checking, the conversion from
"
to `` is an artifact of the original penn treebank word tokenizer.It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49
And as for the single quotes, the treebank tokenizer
STARTING_QUOTES
regexes we see that it doesn’t indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.I hope the clarifications helps.