question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LanguageModel split fails when there is unrecognized characters

See original GitHub issue

Hi

I am using the LanguageModel split with a wordlist for Mandarin chinese using these lists: https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists (3rd column with accents removed, file attached)

pinyin.txt.gz

I have noticed this behaviour (xxx is a sequence of characters non recognized)

>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingxxxdaibiaochu')
['beijing', 'x', 'x', 'x', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']

Expected output should be:

['xxx', 'beijing', 'daibiao', 'chu']
['beijing', 'xxx', 'daibiao', 'chu']

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
srandalcommented, Sep 5, 2019

For English possessive forms to be split correctly, the word list must include this entry: 's After it becomes a separate token, a post-processing step reattaches it to the preceding word.

Also make sure your word list includes contractions, because many of those don’t end with 's. (You can find contractions grouped together near the end of the default list.)

0reactions
nitindesaiikscommented, Sep 5, 2019

Also it is not rejoining "today ’ s " to -> “today’s” with other LanguageModel: >>> import wordninja >>> text="I have today's appointment." >>> text = " ".join(wordninja.split(text)) >>> print("output of wordninja:",text) output of wordninja: I have today's appointment >>> lm = wordninja.LanguageModel('./words-by-frequency_cp.txt.gz') >>> text = " ".join(lm.split(text)) >>> print("output of new wordninja:",text) output of new wordninja: I have today ' s appointment

Read more comments on GitHub >

github_iconTop Results From Across the Web

Split character is not recognized - java - Stack Overflow
So splitting this data properly is unsuccessful at the moment. Any ideas about why this happens and what kind of approach I might...
Read more >
How to Develop a Word-Level Neural Language Model and ...
The model we will train is a neural language model. It has a few unique characteristics: It uses a distributed representation for words...
Read more >
Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >
Tokenization for Natural Language Processing
Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is...
Read more >
Natural Language Processing With spaCy in Python
If the nlp object is created, then it means that spaCy was installed and that models and ... nlp refers to the language...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found