question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem in bunsetu spanning

See original GitHub issue

When I parsed a sentence “彼によって行われる” with GiNZA v4.0.1, I got the following result:

# text = 彼によって行われる
1	彼	彼	PRON	代名詞	_	5	obl	_	SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=カレ
2	に	に	ADP	助詞-格助詞	_	1	case	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ニ
3	よっ	よる	VERB	動詞-一般	_	2	fixed	_	SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Inf=五段-ラ行,連用形-促音便|Reading=ヨッ
4	て	て	SCONJ	助詞-接続助詞	_	2	fixed	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=テ
5	行わ	行う	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Inf=五段-ワア行,未然形-一般|Reading=オコナワ
6	れる	れる	AUX	助動詞	_	5	aux	_	SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-レル,終止形-一般|Reading=レル

This result indicates that there are two bunsetu spans in the sentence, “彼に” and “よって行われる”, which should be “彼によって” and “行われる”.

The problem seems to occur because the bunsetu spanning process in BunsetuRecognizer.__call__() checks only right children for each bunsetu head token, not all right descendants iteratively.

Here I attach a patch I believe to fix it.

--- ginza/bunsetu_recognizer.py~	2020-08-31 22:02:47.000000000 +0900
+++ ginza/bunsetu_recognizer.py	2020-09-03 11:08:32.000000000 +0900
@@ -193,13 +193,20 @@
             t = doc[head_i]
             if next_start < len(bunsetu_bi):
                 bunsetu_bi[next_start] = "B"
-            right = t
-            for sub in t.rights:
-                if heads[sub.i]:
-                    right = right.right_edge
-                    break
-                right = sub
-            next_start = right.i + 1
+            rights = [t]
+            while len(rights):
+                right = rights.pop(0)
+                for sub in right.rights:
+                    if heads[sub.i]:
+                        next_start = right.right_edge.i + 1
+                        break
+                    else:
+                        rights.append(sub)
+                else:
+                    continue
+                break
+            else:
+                next_start = t.right_edge.i + 1
 
         doc.user_data["bunsetu_heads"] = bunsetu_heads
         doc.user_data["bunsetu_bi_labels"] = bunsetu_bi

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
tadashikumanocommented, Sep 3, 2020

As far as examining your attached diff, the revised code seems to work much appropriate than the previous revision. Thanks a lot. I’ll report you if I find any other problem.

1reaction
hiroshi-matsuda-ritcommented, Sep 3, 2020

@tadashikumano Thank you for reporting. I’ve ever implemented above recursive algorithm before releasing v4.0.0, but found some side-effects. I’d like to reevaluate this approach and try to improve total accuracy.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inappropriate bunsetu span detection caused by ... - GitHub
(Of course the main problem is that the head of '去年' is erroneously estimated by the dependency parsing, though.) Such inconsistency between dependency ......
Read more >
GiNZA Version 4.0: Improving Syntactic ... - Megagon Labs
The second component, Bunsetu Recognizer, recognizes phrases and ... Returns a list of spanning head phrase sections of a bunsetsu-phrase:.
Read more >
GiNZA Version 4.0: Improving Syntactic Structure Analysis ...
Returns a list of spanning head phrase sections of a bunsetsu-phrase: ... Note that you can apply the arguments mentioned for ginza.bunsetu() here...
Read more >
the relationship between preceding clause type
Another interpretation of the results is that speakers' maximum encoding span is about 8. Bunsetu at clause boundaries in casual presentations like these....
Read more >
GiNZA - Japanese NLP Library - GitHub Pages
2021-10-15; Bug fix. Bunsetu span should not cross the sentence boundary #195 ... Command Line -s option and set_split_mode() not working in v5.0.x...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found