header training data - mismatching features columns?
See original GitHub issueI’ve been testing the header model training data using features #76 and I’ve incurred in one small problem, it seems that the number of columns are not consistent.
For example: ‘anaesthesia’ on line 938 and ‘elsevier’, line 813 have different number of columns (30 vs 31):
ELSEVIER elsevier E EL ELS ELSE R ER IER VIER BLOCKSTART LINESTART NEWFONT HIGHERFONT 0 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 0 0 NOPUNCT 0 0 I-<note>
Anaesthesia anaesthesia A An Ana Anae a ia sia esia BLOCKSTART LINESTART LINEINDENT NEWFONT HIGHERFONT 0 0 0 INITCAP NODIGIT 0 0 1 0 0 0 0 0 NOPUNCT 0 0 <reference>
I’m not sure is a bug (at least not in the current version - which is ignoring these information), and also I’m not sure this is the right place, but training the header model will fail with the automatic feature discovery enabled
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
@kermitt2 thanks! I will check today.
yes reference-segmenter model (not segmentation). No clue how some of these features arrived there! The features are combined with the header labels in a late stage, so the labels are not a good hint.