question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

punkt.PunktSentenceTokenizer() for Chinese

See original GitHub issue

I use the following code to train punkt for Chinese, but it doesn’t produce desired result:

input_str_cn = "台湾之所以出现这种危机,是台湾不但长年低薪,且不知远景在哪里。20世纪90年代,台湾的大学毕业生起薪不到新台币3万元(约合人民币6594元),到了今天,依然如此。"

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in training corpus 
import codecs

train_file = "D:/CL/comp_ling/data/dushu_1999_2008/1999.txt"
text = codecs.open(train_file, "r", "gb18030").read()

# Train tokenizer
tokenizer.sent_end_chars = ('!','?','。','”')
for sent_end in tokenizer.sent_end_chars:
    print sent_end
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("chinese.pickle","wb")
pickle.dump(tokenizer, out)
out.close()

# To use the tokenizer
with open("chinese.pickle") as infile:
    tokenizer_new = pickle.load(infile)
sents = tokenizer_new.tokenize(input_str_cn)
for s in sents:
    print s

The produced result is as follows:

“台湾之所以出现这种危机,是台湾不但长年低薪,且不知远景在哪里。20世纪90年代,台湾的大学毕业生起薪不到新台币3万元(约合人民币6594元),到了今天,依然如此。”

It seems that the sent_end_chars does not work here. I have checked the encoding. There’s no problem with that. Could anyone help with it? Thanks.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:2
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
jnothmancommented, Sep 4, 2017

I don’t think the space is going to make a difference except that our implementation expects it: you can either change _period_context_fmt, or add a space before processing and strip it afterwards. I’d be interested to hear if Punkt resolved any of the ambiguities in Chinese sentence boundaries.

1reaction
jnothmancommented, Sep 4, 2017

Punkt here only considers a sent_end_char to be a potential sentence boundary if it is followed by either whitespace or punctuation (see _period_context_fmt). The absence of a whitespace character after “。” is sufficient for it to not be picked up.

I have my doubts about the applicability of Punkt to Chinese. Does “。” not deterministically mark the end of a sentence in Chinese? Is it ambiguous? Is it used for abbreviations?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training punkt tokenizer with Chinese texts - Google Groups
I am having trouble and I can't seem to make the punkt tokenizer produce a proper pickle file (one that actually works). I...
Read more >
nltk.tokenize.punkt
TXT r""" Punkt Sentence Tokenizer This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a...
Read more >
data/nltk_data/tokenizers/punkt - GitLab
Training Code ---- # import punkt import nltk.tokenize.punkt # Make a new Tokenizer tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer() # Read in ...
Read more >
Can't import PunktSentenceTokenizer from nltk.corpus
PunktSentenceTokenizer(). but the below code gives error again: tagged_sentences = nltk.corpus.treebank.tagged_sents() cutoff = int(.75 ...
Read more >
Tokenization — Python Notes for Linguistics
Chinese may require an additional step, i.e., the word segmentation, ... The sent_tokenize() function uses an instance of PunktSentenceTokenizer from the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found