question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent example output

See original GitHub issue

Installed 6.3.7 from pip as instructed on both latest macOS and ubuntu 16.04 and downloaded frequency_dictionary_en_82_765.txt from the official github. Just ran the examples in the README.md and got inconsistent output on both platforms as follows:

Sample usage (lookup and lookup_compound)

The last number(log_prob_sum) is 11 instead of 10.

members, 226656153, 1
where is to love he had dated for much of the past who couldn't read in six grade and inspired him, 300000, 11

Sample usage (word_segmentation)

The first word the is segmented as t and he which are a bit obvious. Also overt he should be over the. I noticed the last two numbers are different from 8 -34.491167981910635.

t he quick brown fox jumps overt he lazy dog, 10, -52.10066239535173

Next, I tried to segment the test string itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness from the official site and the output shows the same error pattern of the as t he.

Any ideas?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mammothbcommented, Mar 12, 2019

I see, I might have saved it as utf-8 when I was debugging the program and uploaded it as that.

load_dictionary allows you to choose the encoding so you could use that as well.

0reactions
farleylaicommented, Mar 11, 2019

I finally figured out the cause which has nothing to do with the installation but some invisible character in the frequency dictionary that I downloaded from the original author wolfgarbe’s repo:

frequency_dictionary_en_82_765.txt

Running diff to compare the files, showing there is an invisible difference in line 1 that is the coincidentally:

1c1
< the 23135851162
---
> the 23135851162

I tried :set list in vim to show the invisible chars but there seems nothing different. Then the first line was print and it happens to be the unicode Byte Order Mark (BOM) causing the issue as discussed in the thread:

>>> f = open('frequency_dictionary_en_82_765.txt.wolfgarbe')
>>> f.readline()
'\ufeffthe 23135851162\n'

One workaround would be to set the encoding argument as follows:

>>> f = open('frequency_dictionary_en_82_765.txt.wolfgarbe', encoding='utf-8-sig')
>>> f.readline()
'the 23135851162\n'
Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistent Definition & Meaning - YourDictionary
When one scientist does an experiment and gets one result and the other does it and gets a contrary result, this is an...
Read more >
Inconsistent System of Equations | Overview, Steps & Examples
Learn about inconsistent systems of equations. Study graphs of inconsistent solutions, and discover how to identify inconsistent systems ...
Read more >
Consistent And Inconsistent Systems - Maths - Vedantu
Learn about Consistent And Inconsistent Systems of Maths in detail on vedantu.com. ... calculation, method, solved examples and faqs for better understanding.
Read more >
Inconsistent results definition and meaning - Collins Dictionary
If you describe someone as inconsistent, you are criticizing them for not behaving in the same way every time a similar situation occurs....
Read more >
Consistent and Inconsistent Linear ... - CK12-Foundation
To identify a system as consistent, inconsistent, or dependent, we can graph the two lines on the same graph and see if they...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found