len(start_token) != 1 in parse_ace_event.py
See original GitHub issueWhen preprocessing ACE 2005 dataset via parse_ace_event.py
d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(780)<module>() -> main() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(776)main() -> include_pronouns=args.include_pronouns) d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(750)one_fold() -> js = document.to_json() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(726)to_json() -> js = doc.to_json() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(229)to_json() -> self.remove_whitespace() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(216)remove_whitespace() -> entry.remove_whitespace() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(160)remove_whitespace() -> self.align() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(152)align() -> entity.align(self.sent) d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(43)align() -> self.span_sentence = get_token_indices(self, sent.as_doc()) d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(248)get_token_indices() -> debug_if(len(start_token) != 1)
As you seen here, len(start_token) != 1 .
The bug rose from here:
def get_token_indices(entity, sent):
start_token = [tok for tok in sent if tok.idx == entity.start_char]
And the reason is when feeding sent.as_doc() into get_token_indices(), the token.idx indexes from the beginning of this sentence while entity.start_char is count from start of the whole document. So start_token=[]
Could you please fix this bug? Or is there anything I did wrong?
Best regards.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
you should use the spacy==2.0.18
It seems that this bug came from the mistaken sparsing of sentences. The recommended way is installing the same spacy version as this repo. The other way is to rectify the training articles, delete the special tokens like ‘–’, which causes the wrong spliting of sentences and gives rise to the bug. All in all, it is a tortuous process.