Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

punkt.PunktSentenceTokenizer() for Chinese

See original GitHub issue

I use the following code to train punkt for Chinese, but it doesn’t produce desired result:

input_str_cn = "台湾之所以出现这种危机，是台湾不但长年低薪，且不知远景在哪里。20世纪90年代，台湾的大学毕业生起薪不到新台币3万元（约合人民币6594元），到了今天，依然如此。"

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in training corpus 
import codecs

train_file = "D:/CL/comp_ling/data/dushu_1999_2008/1999.txt"
text = codecs.open(train_file, "r", "gb18030").read()

# Train tokenizer
tokenizer.sent_end_chars = ('！','？','。','”')
for sent_end in tokenizer.sent_end_chars:
    print sent_end
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("chinese.pickle","wb")
pickle.dump(tokenizer, out)
out.close()

# To use the tokenizer
with open("chinese.pickle") as infile:
    tokenizer_new = pickle.load(infile)
sents = tokenizer_new.tokenize(input_str_cn)
for s in sents:
    print s

The produced result is as follows:

“台湾之所以出现这种危机，是台湾不但长年低薪，且不知远景在哪里。20世纪90年代，台湾的大学毕业生起薪不到新台币3万元（约合人民币6594元），到了今天，依然如此。”

It seems that the sent_end_chars does not work here. I have checked the encoding. There’s no problem with that. Could anyone help with it? Thanks.

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Sep 4, 2017

I don’t think the space is going to make a difference except that our implementation expects it: you can either change _period_context_fmt, or add a space before processing and strip it afterwards. I’d be interested to hear if Punkt resolved any of the ambiguities in Chinese sentence boundaries.

1reaction

jnothmancommented, Sep 4, 2017

Punkt here only considers a sent_end_char to be a potential sentence boundary if it is followed by either whitespace or punctuation (see _period_context_fmt). The absence of a whitespace character after “。” is sufficient for it to not be picked up.

I have my doubts about the applicability of Punkt to Chinese. Does “。” not deterministically mark the end of a sentence in Chinese? Is it ambiguous? Is it used for abbreviations?

Top Results From Across the Web

Training punkt tokenizer with Chinese texts - Google Groups

I am having trouble and I can't seem to make the punkt tokenizer produce a proper pickle file (one that actually works). I...

nltk.tokenize.punkt

TXT r""" Punkt Sentence Tokenizer This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a...

data/nltk_data/tokenizers/punkt - GitLab

Training Code ---- # import punkt import nltk.tokenize.punkt # Make a new Tokenizer tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer() # Read in ...

Can't import PunktSentenceTokenizer from nltk.corpus

PunktSentenceTokenizer(). but the below code gives error again: tagged_sentences = nltk.corpus.treebank.tagged_sents() cutoff = int(.75 ...

Tokenization — Python Notes for Linguistics

Chinese may require an additional step, i.e., the word segmentation, ... The sent_tokenize() function uses an instance of PunktSentenceTokenizer from the ...