Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

len(start_token) != 1 in parse_ace_event.py

See original GitHub issue

When preprocessing ACE 2005 dataset via parse_ace_event.py

d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(780)<module>() -> main() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(776)main() -> include_pronouns=args.include_pronouns) d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(750)one_fold() -> js = document.to_json() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(726)to_json() -> js = doc.to_json() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(229)to_json() -> self.remove_whitespace() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(216)remove_whitespace() -> entry.remove_whitespace() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(160)remove_whitespace() -> self.align() d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(152)align() -> entity.align(self.sent) d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(43)align() -> self.span_sentence = get_token_indices(self, sent.as_doc()) d:\github project\eeqa-master\proc\scripts\data\ace-event\parse_ace_event.py(248)get_token_indices() -> debug_if(len(start_token) != 1)

As you seen here, len(start_token) != 1 .

The bug rose from here:

def get_token_indices(entity, sent):
     start_token = [tok for tok in sent if tok.idx == entity.start_char]

And the reason is when feeding sent.as_doc() into get_token_indices(), the token.idx indexes from the beginning of this sentence while entity.start_char is count from start of the whole document. So start_token=[]

Could you please fix this bug? Or is there anything I did wrong?

Best regards.

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

1reaction

DevoAllencommented, Aug 31, 2021

you should use the spacy==2.0.18

0reactions

linmoucommented, Aug 19, 2022

It seems that this bug came from the mistaken sparsing of sentences. The recommended way is installing the same spacy version as this repo. The other way is to rectify the training articles, delete the special tokens like ‘–’, which causes the wrong spliting of sentences and gives rise to the bug. All in all, it is a tortuous process.