punkt.PunktSentenceTokenizer() for Chinese
See original GitHub issueI use the following code to train punkt for Chinese, but it doesn’t produce desired result:
input_str_cn = "台湾之所以出现这种危机,是台湾不但长年低薪,且不知远景在哪里。20世纪90年代,台湾的大学毕业生起薪不到新台币3万元(约合人民币6594元),到了今天,依然如此。"
# import punkt
import nltk.tokenize.punkt
# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
# Read in training corpus
import codecs
train_file = "D:/CL/comp_ling/data/dushu_1999_2008/1999.txt"
text = codecs.open(train_file, "r", "gb18030").read()
# Train tokenizer
tokenizer.sent_end_chars = ('!','?','。','”')
for sent_end in tokenizer.sent_end_chars:
print sent_end
tokenizer.train(text)
# Dump pickled tokenizer
import pickle
out = open("chinese.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
# To use the tokenizer
with open("chinese.pickle") as infile:
tokenizer_new = pickle.load(infile)
sents = tokenizer_new.tokenize(input_str_cn)
for s in sents:
print s
The produced result is as follows:
“台湾之所以出现这种危机,是台湾不但长年低薪,且不知远景在哪里。20世纪90年代,台湾的大学毕业生起薪不到新台币3万元(约合人民币6594元),到了今天,依然如此。”
It seems that the sent_end_chars does not work here. I have checked the encoding. There’s no problem with that. Could anyone help with it? Thanks.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Training punkt tokenizer with Chinese texts - Google Groups
I am having trouble and I can't seem to make the punkt tokenizer produce a proper pickle file (one that actually works). I...
Read more >nltk.tokenize.punkt
TXT r""" Punkt Sentence Tokenizer This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a...
Read more >data/nltk_data/tokenizers/punkt - GitLab
Training Code ---- # import punkt import nltk.tokenize.punkt # Make a new Tokenizer tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer() # Read in ...
Read more >Can't import PunktSentenceTokenizer from nltk.corpus
PunktSentenceTokenizer(). but the below code gives error again: tagged_sentences = nltk.corpus.treebank.tagged_sents() cutoff = int(.75 ...
Read more >Tokenization — Python Notes for Linguistics
Chinese may require an additional step, i.e., the word segmentation, ... The sent_tokenize() function uses an instance of PunktSentenceTokenizer from the ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I don’t think the space is going to make a difference except that our implementation expects it: you can either change
_period_context_fmt, or add a space before processing and strip it afterwards. I’d be interested to hear if Punkt resolved any of the ambiguities in Chinese sentence boundaries.Punkt here only considers a
sent_end_charto be a potential sentence boundary if it is followed by either whitespace or punctuation (see_period_context_fmt). The absence of a whitespace character after “。” is sufficient for it to not be picked up.I have my doubts about the applicability of Punkt to Chinese. Does “。” not deterministically mark the end of a sentence in Chinese? Is it ambiguous? Is it used for abbreviations?