Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LanguageModel split fails when there is unrecognized characters

See original GitHub issue

I am using the LanguageModel split with a wordlist for Mandarin chinese using these lists: https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists (3rd column with accents removed, file attached)

pinyin.txt.gz

I have noticed this behaviour (xxx is a sequence of characters non recognized)

>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingxxxdaibiaochu')
['beijing', 'x', 'x', 'x', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

Expected output should be:

['xxx', 'beijing', 'daibiao', 'chu']
['beijing', 'xxx', 'daibiao', 'chu']

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

srandalcommented, Sep 5, 2019

For English possessive forms to be split correctly, the word list must include this entry: 's After it becomes a separate token, a post-processing step reattaches it to the preceding word.

Also make sure your word list includes contractions, because many of those don’t end with 's. (You can find contractions grouped together near the end of the default list.)

0reactions

nitindesaiikscommented, Sep 5, 2019

Also it is not rejoining "today ’ s " to -> “today’s” with other LanguageModel: >>> import wordninja >>> text="I have today's appointment." >>> text = " ".join(wordninja.split(text)) >>> print("output of wordninja:",text) output of wordninja: I have today's appointment >>> lm = wordninja.LanguageModel('./words-by-frequency_cp.txt.gz') >>> text = " ".join(lm.split(text)) >>> print("output of new wordninja:",text) output of new wordninja: I have today ' s appointment