LanguageModel split fails when there is unrecognized characters
See original GitHub issueHi
I am using the LanguageModel split with a wordlist for Mandarin chinese using these lists: https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists (3rd column with accents removed, file attached)
I have noticed this behaviour (xxx is a sequence of characters non recognized)
>>> lm = wordninja.LanguageModel('pinyin.txt.gz')
>>> lm.split('beijingdaibiaochu')
['beijing', 'daibiao', 'chu']
>>> lm.split('xxxbeijingdaibiaochu')
['x', 'x', 'x', 'b', 'e', 'i', 'j', 'i', 'n', 'g', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
>>> lm.split('beijingxxxdaibiaochu')
['beijing', 'x', 'x', 'x', 'd', 'a', 'i', 'b', 'i', 'a', 'o', 'c', 'h', 'u']
Expected output should be:
['xxx', 'beijing', 'daibiao', 'chu']
['beijing', 'xxx', 'daibiao', 'chu']
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Split character is not recognized - java - Stack Overflow
So splitting this data properly is unsuccessful at the moment. Any ideas about why this happens and what kind of approach I might...
Read more >How to Develop a Word-Level Neural Language Model and ...
The model we will train is a neural language model. It has a few unique characteristics: It uses a distributed representation for words...
Read more >Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >Tokenization for Natural Language Processing
Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is...
Read more >Natural Language Processing With spaCy in Python
If the nlp object is created, then it means that spaCy was installed and that models and ... nlp refers to the language...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For English possessive forms to be split correctly, the word list must include this entry: 's After it becomes a separate token, a post-processing step reattaches it to the preceding word.
Also make sure your word list includes contractions, because many of those don’t end with 's. (You can find contractions grouped together near the end of the default list.)
Also it is not rejoining "today ’ s " to -> “today’s” with other LanguageModel:
>>> import wordninja
>>> text="I have today's appointment."
>>> text = " ".join(wordninja.split(text))
>>> print("output of wordninja:",text)
output of wordninja: I have today's appointment
>>> lm = wordninja.LanguageModel('./words-by-frequency_cp.txt.gz')
>>> text = " ".join(lm.split(text))
>>> print("output of new wordninja:",text)
output of new wordninja: I have today ' s appointment