Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem in bunsetu spanning

See original GitHub issue

When I parsed a sentence “彼によって行われる” with GiNZA v4.0.1, I got the following result:

# text = 彼によって行われる
1	彼	彼	PRON	代名詞	_	5	obl	_	SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=カレ
2	に	に	ADP	助詞-格助詞	_	1	case	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ニ
3	よっ	よる	VERB	動詞-一般	_	2	fixed	_	SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Inf=五段-ラ行,連用形-促音便|Reading=ヨッ
4	て	て	SCONJ	助詞-接続助詞	_	2	fixed	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=テ
5	行わ	行う	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Inf=五段-ワア行,未然形-一般|Reading=オコナワ
6	れる	れる	AUX	助動詞	_	5	aux	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-レル,終止形-一般|Reading=レル

This result indicates that there are two bunsetu spans in the sentence, “彼に” and “よって行われる”, which should be “彼によって” and “行われる”.

The problem seems to occur because the bunsetu spanning process in BunsetuRecognizer.__call__() checks only right children for each bunsetu head token, not all right descendants iteratively.

Here I attach a patch I believe to fix it.

--- ginza/bunsetu_recognizer.py~	2020-08-31 22:02:47.000000000 +0900
+++ ginza/bunsetu_recognizer.py	2020-09-03 11:08:32.000000000 +0900
@@ -193,13 +193,20 @@
             t = doc[head_i]
             if next_start < len(bunsetu_bi):
                 bunsetu_bi[next_start] = "B"
-            right = t
-            for sub in t.rights:
-                if heads[sub.i]:
-                    right = right.right_edge
-                    break
-                right = sub
-            next_start = right.i + 1
+            rights = [t]
+            while len(rights):
+                right = rights.pop(0)
+                for sub in right.rights:
+                    if heads[sub.i]:
+                        next_start = right.right_edge.i + 1
+                        break
+                    else:
+                        rights.append(sub)
+                else:
+                    continue
+                break
+            else:
+                next_start = t.right_edge.i + 1
 
         doc.user_data["bunsetu_heads"] = bunsetu_heads
         doc.user_data["bunsetu_bi_labels"] = bunsetu_bi

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

tadashikumanocommented, Sep 3, 2020

As far as examining your attached diff, the revised code seems to work much appropriate than the previous revision. Thanks a lot. I’ll report you if I find any other problem.

1reaction

hiroshi-matsuda-ritcommented, Sep 3, 2020

@tadashikumano Thank you for reporting. I’ve ever implemented above recursive algorithm before releasing v4.0.0, but found some side-effects. I’d like to reevaluate this approach and try to improve total accuracy.

Top Results From Across the Web

Inappropriate bunsetu span detection caused by ... - GitHub

(Of course the main problem is that the head of '去年' is erroneously estimated by the dependency parsing, though.) Such inconsistency between dependency ......

GiNZA Version 4.0: Improving Syntactic ... - Megagon Labs

The second component, Bunsetu Recognizer, recognizes phrases and ... Returns a list of spanning head phrase sections of a bunsetsu-phrase:.

GiNZA Version 4.0: Improving Syntactic Structure Analysis ...

Returns a list of spanning head phrase sections of a bunsetsu-phrase: ... Note that you can apply the arguments mentioned for ginza.bunsetu() here...

the relationship between preceding clause type

Another interpretation of the results is that speakers' maximum encoding span is about 8. Bunsetu at clause boundaries in casual presentations like these....

GiNZA - Japanese NLP Library - GitHub Pages

2021-10-15; Bug fix. Bunsetu span should not cross the sentence boundary #195 ... Command Line -s option and set_split_mode() not working in v5.0.x...