question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Improvements to the Japanese model

See original GitHub issue

This project is impressive. I’ve tested this and it’s extremely more precise than Tesseract, thank you for your effort.

There are a few issues I’ve noticed on some test cases I’ve tested, some I think are caused by missing symbols, others are more specific to the Japanese language, and could be improved from the context of a dictionary (looks like ja has only characters right now).

Is there anyway I can help you? I can certainly add the missing characters to the characters list, and I’m willing to also build a dictionary if that could help disambiguating some words. But I would have to wait for you to retrain the model on your side?

Here are my test cases:

1 - いや…あるというか…って→

https://0x0.st/ivNP.png

issues:

Missing missing mistakes って for つて (fixable with a dictionary file I think)

result:

([[208, 0], [628, 0], [628, 70], [208, 70]], 'あるというか:', 0.07428263872861862)
([[0, 1], [185, 1], [185, 69], [0, 69]], 'いや・', 0.2885110080242157)
([[3, 69], [183, 69], [183, 128], [3, 128]], 'つて', 0.4845466613769531)

2 - ♬〜(これが私の生きる道)

https://0x0.st/ivZC.png

issues:

Missing ♬〜 Mistakes (これ for にれ Mistakes が(ga) for か(ka) Detects English parens () instead of Japanese parens ()

([[1, 0], [125, 0], [125, 63], [1, 63]], ',~', 0.10811009258031845)
([[179, 0], [787, 0], [787, 66], [179, 66]], 'にれか私の生きる道)', 0.3134567439556122)

3 - (秀一)ああッ…もう⁉︎

https://0x0.st/ivZh.png

issues:

Mistakes small for big (similar to 1 but katakana instead of hiragana) Mistakes for ・・・

([[0, 0], [174, 0], [174, 64], [0, 64]], '(秀一)', 0.9035432934761047)
([[207, 0], [457, 0], [457, 64], [207, 64]], 'ああツ・・・', 0.35586389899253845)
([[481, 0], [668, 0], [668, 64], [481, 64]], 'もう!?', 0.4920879304409027)

4 - そっか

https://0x0.st/ivZ7.png

issues:

mistakes そっか for そつか (fixable with a dictionary file, I think) (similar to 1 そっか is a really common word)

([[0, 0], [186, 0], [186, 60], [0, 60]], 'そつか', 0.9190227389335632)

5 - (久美子)うん ヘアピンのお礼

https://0x0.st/ivZR.png

issues:

mistakes for 0 (not sure how to fix this one, but it’s pretty important – seems like is properly recognized in my test case 2) mistakes ヘアピン for へアピソ (fixable by dictionary probably)

([[0, 0], [238, 0], [238, 72], [0, 72]], '(久美子)', 0.9745591878890991)
([[268, 0], [396, 0], [396, 70], [268, 70]], 'うん', 0.5724520087242126)
([[22, 60], [454, 60], [454, 132], [22, 132]], 'へアピソ0お礼', 0.25971919298171997)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:5
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
pigozcommented, Jul 16, 2020

Set 1: dictionary problem (って for つて, そっか for そつか, ヘアピン for へアピソ )

This is my fault for not knowing Japanese enough. When I saw Japanese has several thousand characters, I assume it’s 1 char per word. I just learn it’s not the case from this issue.

Solution: I need word list ja.txt. Please create a pull request or ask people in Japanese community. I’m sure they have word list somewhere. Also check missing character in ja_char.txt. I’ll have to retrain the model later.

Great, I will make a script that creates a word list starting from the Ipadict dataset. I will inspect it and decide if it makes sense to add some popular verb conjugations that could help with って/つて disambiguation, and then create a PR.

Set 2: missing characters (, ,, japanese parentheses)

Currently all languages support following symbols

https://github.com/JaidedAI/EasyOCR/blob/b626a255b769f5ae204072e222ce528c784a746d/easyocr/easyocr.py#L40-L41

There’s a request for . I can agree to add , . For and japanese parentheses, I think we have to pass, otherwise I will have to add a lot more.

Oh nice, thanks for the character list. I can understand not adding Japanese parenthesis. is fairly common in subtitles though (even non Japanese ones).

By looking at the character list you linked I can tell it’s missing the Japanese quotes. In Japanese writing they use 「something」 instead of quotes "something"

I wonder if it could make sense to add these special Japanese symbols to character/ja.txt, because stuff like 「something」 obviously will not come up in non-Japanese scripts. There’s a bunch of Japanese-specific symbols I can think of on the top of my head that are fairly common !?「」。・…『』→♬. I could prepare a PR for that.

1reaction
rkcosmoscommented, Jul 16, 2020

Thanks, analysis like this is very helpful for further improvement. Let’s go through each issue

Set 1: dictionary problem (って for つて, そっか for そつか, ヘアピン for へアピソ )

This is my fault for not knowing Japanese enough. When I saw Japanese has several thousand characters, I assume it’s 1 char per word. I just learn it’s not the case from this issue.

Solution: I need word list ja.txt. Please create a pull request or ask people in Japanese community. I’m sure they have word list somewhere. Also check missing character in ja_char.txt. I’ll have to retrain the model later.

Set 2: missing characters (, ,, japanese parentheses)

Currently all languages support following symbols

https://github.com/JaidedAI/EasyOCR/blob/b626a255b769f5ae204072e222ce528c784a746d/easyocr/easyocr.py#L40-L41

There’s a request for . I can agree to add , . For and japanese parentheses, I think we have to pass, otherwise I will have to add a lot more.

Solution: You don’t have to do anything. I’ll add more symbols later. It’s just not supported now.

Set 3: others ( vs 0)

Solution 1: After having word list, it’s possible to build language model (bi-gram, tri-gram, etc) and integrate into prediction process. So the model will aware of surrounding characters when trying to predict one. Solution 2: I plan to apply manual probability bias that favor having adjacent char from the same set. For example, if surrounding are numbers, the middle should be number not Japanese or English. This will solve problem like 7o0 instead of 700 in other languages as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RFC 3743 - Joint Engineering Team (JET) Guidelines for ...
Joint Engineering Team (JET) Guidelines for Internationalized Domain Names (IDN) Registration and Administration for Chinese, Japanese, and Korean RFC 3743.
Read more >
RFC 2299: Request for Comments Summary
Discussion and suggestions for improvement are requested. ... 2275 Wijnen Jan 1998 View-based Access Control Model (VACM) for the Simple Network Management ...
Read more >
Characteristics of the Japan Riverfront Research Center (RFC)
the Foundation for Riverfront Improvement and Restoration established in 1987, ... Characteristics of the Japan Riverfront Research Center (RFC).
Read more >
Credit FAQ: Understanding S&P Global Ratings' Request For ...
S&P Global Ratings has published a request for comment (RFC) on proposed changes to its rating methodology for insurers' risk-based capital ...
Read more >
MULTI-SCALE AND MULTI-MODEL INTEGRATION FOR ...
MULTI-SCALE AND MULTI-MODEL INTEGRATION FOR IMPROVED PERFORMANCE. IN CHINESE SPOKEN DOCUMENT RETRIEVAL. Wai-Kit LO. 1. , Helen MENG. 2. , and P. C....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found