Problem in bunsetu spanning
See original GitHub issueWhen I parsed a sentence “彼によって行われる” with GiNZA v4.0.1, I got the following result:
# text = 彼によって行われる
1 彼 彼 PRON 代名詞 _ 5 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=カレ
2 に に ADP 助詞-格助詞 _ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ニ
3 よっ よる VERB 動詞-一般 _ 2 fixed _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Inf=五段-ラ行,連用形-促音便|Reading=ヨッ
4 て て SCONJ 助詞-接続助詞 _ 2 fixed _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=テ
5 行わ 行う VERB 動詞-一般 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Inf=五段-ワア行,未然形-一般|Reading=オコナワ
6 れる れる AUX 助動詞 _ 5 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-レル,終止形-一般|Reading=レル
This result indicates that there are two bunsetu spans in the sentence, “彼に” and “よって行われる”, which should be “彼によって” and “行われる”.
The problem seems to occur because the bunsetu spanning process in BunsetuRecognizer.__call__()
checks only right children for each bunsetu head token, not all right descendants iteratively.
Here I attach a patch I believe to fix it.
--- ginza/bunsetu_recognizer.py~ 2020-08-31 22:02:47.000000000 +0900
+++ ginza/bunsetu_recognizer.py 2020-09-03 11:08:32.000000000 +0900
@@ -193,13 +193,20 @@
t = doc[head_i]
if next_start < len(bunsetu_bi):
bunsetu_bi[next_start] = "B"
- right = t
- for sub in t.rights:
- if heads[sub.i]:
- right = right.right_edge
- break
- right = sub
- next_start = right.i + 1
+ rights = [t]
+ while len(rights):
+ right = rights.pop(0)
+ for sub in right.rights:
+ if heads[sub.i]:
+ next_start = right.right_edge.i + 1
+ break
+ else:
+ rights.append(sub)
+ else:
+ continue
+ break
+ else:
+ next_start = t.right_edge.i + 1
doc.user_data["bunsetu_heads"] = bunsetu_heads
doc.user_data["bunsetu_bi_labels"] = bunsetu_bi
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Inappropriate bunsetu span detection caused by ... - GitHub
(Of course the main problem is that the head of '去年' is erroneously estimated by the dependency parsing, though.) Such inconsistency between dependency ......
Read more >GiNZA Version 4.0: Improving Syntactic ... - Megagon Labs
The second component, Bunsetu Recognizer, recognizes phrases and ... Returns a list of spanning head phrase sections of a bunsetsu-phrase:.
Read more >GiNZA Version 4.0: Improving Syntactic Structure Analysis ...
Returns a list of spanning head phrase sections of a bunsetsu-phrase: ... Note that you can apply the arguments mentioned for ginza.bunsetu() here...
Read more >the relationship between preceding clause type
Another interpretation of the results is that speakers' maximum encoding span is about 8. Bunsetu at clause boundaries in casual presentations like these....
Read more >GiNZA - Japanese NLP Library - GitHub Pages
2021-10-15; Bug fix. Bunsetu span should not cross the sentence boundary #195 ... Command Line -s option and set_split_mode() not working in v5.0.x...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As far as examining your attached diff, the revised code seems to work much appropriate than the previous revision. Thanks a lot. I’ll report you if I find any other problem.
@tadashikumano Thank you for reporting. I’ve ever implemented above recursive algorithm before releasing v4.0.0, but found some side-effects. I’d like to reevaluate this approach and try to improve total accuracy.