[RFC] Improvements to the Japanese model
See original GitHub issueThis project is impressive. I’ve tested this and it’s extremely more precise than Tesseract, thank you for your effort.
There are a few issues I’ve noticed on some test cases I’ve tested, some I think are caused by missing symbols, others are more specific to the Japanese language, and could be improved from the context of a dictionary (looks like ja has only characters right now).
Is there anyway I can help you? I can certainly add the missing characters to the characters list, and I’m willing to also build a dictionary if that could help disambiguating some words. But I would have to wait for you to retrain the model on your side?
Here are my test cases:
1 - いや…あるというか…って→
issues:
Missing …
missing →
mistakes って for つて (fixable with a dictionary file I think)
result:
([[208, 0], [628, 0], [628, 70], [208, 70]], 'あるというか:', 0.07428263872861862)
([[0, 1], [185, 1], [185, 69], [0, 69]], 'いや・', 0.2885110080242157)
([[3, 69], [183, 69], [183, 128], [3, 128]], 'つて', 0.4845466613769531)
2 - ♬〜(これが私の生きる道)
issues:
Missing ♬〜
Mistakes (これ
for にれ
Mistakes が(ga)
for か(ka)
Detects English parens ()
instead of Japanese parens ()
([[1, 0], [125, 0], [125, 63], [1, 63]], ',~', 0.10811009258031845)
([[179, 0], [787, 0], [787, 66], [179, 66]], 'にれか私の生きる道)', 0.3134567439556122)
3 - (秀一)ああッ…もう⁉︎
issues:
Mistakes small ッ
for big ツ
(similar to 1 but katakana instead of hiragana)
Mistakes …
for ・・・
([[0, 0], [174, 0], [174, 64], [0, 64]], '(秀一)', 0.9035432934761047)
([[207, 0], [457, 0], [457, 64], [207, 64]], 'ああツ・・・', 0.35586389899253845)
([[481, 0], [668, 0], [668, 64], [481, 64]], 'もう!?', 0.4920879304409027)
4 - そっか
issues:
mistakes そっか for そつか (fixable with a dictionary file, I think) (similar to 1 そっか is a really common word)
([[0, 0], [186, 0], [186, 60], [0, 60]], 'そつか', 0.9190227389335632)
5 - (久美子)うん ヘアピンのお礼
issues:
mistakes の
for 0
(not sure how to fix this one, but it’s pretty important – seems like の
is properly recognized in my test case 2)
mistakes ヘアピン
for へアピソ
(fixable by dictionary probably)
([[0, 0], [238, 0], [238, 72], [0, 72]], '(久美子)', 0.9745591878890991)
([[268, 0], [396, 0], [396, 70], [268, 70]], 'うん', 0.5724520087242126)
([[22, 60], [454, 60], [454, 132], [22, 132]], 'へアピソ0お礼', 0.25971919298171997)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:12 (11 by maintainers)
Great, I will make a script that creates a word list starting from the Ipadict dataset. I will inspect it and decide if it makes sense to add some popular verb conjugations that could help with
って/つて
disambiguation, and then create a PR.Oh nice, thanks for the character list. I can understand not adding Japanese parenthesis.
♬
is fairly common in subtitles though (even non Japanese ones).By looking at the character list you linked I can tell it’s missing the Japanese quotes. In Japanese writing they use
「something」
instead of quotes"something"
I wonder if it could make sense to add these special Japanese symbols to
character/ja.txt
, because stuff like「something」
obviously will not come up in non-Japanese scripts. There’s a bunch of Japanese-specific symbols I can think of on the top of my head that are fairly common!?「」。・…『』→♬
. I could prepare a PR for that.Thanks, analysis like this is very helpful for further improvement. Let’s go through each issue
Set 1: dictionary problem (
って
forつて
,そっか
forそつか
,ヘアピン
forへアピソ
)This is my fault for not knowing Japanese enough. When I saw Japanese has several thousand characters, I assume it’s 1 char per word. I just learn it’s not the case from this issue.
Solution: I need word list ja.txt. Please create a pull request or ask people in Japanese community. I’m sure they have word list somewhere. Also check missing character in ja_char.txt. I’ll have to retrain the model later.
Set 2: missing characters (
…
,→
,♬
, japanese parentheses)Currently all languages support following symbols
https://github.com/JaidedAI/EasyOCR/blob/b626a255b769f5ae204072e222ce528c784a746d/easyocr/easyocr.py#L40-L41
There’s a request for
€
. I can agree to add…
,→
. For♬
and japanese parentheses, I think we have to pass, otherwise I will have to add a lot more.Solution: You don’t have to do anything. I’ll add more symbols later. It’s just not supported now.
Set 3: others (
の
vs0
)Solution 1: After having word list, it’s possible to build language model (bi-gram, tri-gram, etc) and integrate into prediction process. So the model will aware of surrounding characters when trying to predict one. Solution 2: I plan to apply manual probability bias that favor having adjacent char from the same set. For example, if surrounding are numbers, the middle should be number not Japanese or English. This will solve problem like 7o0 instead of 700 in other languages as well.