ptr type="web" note detected
See original GitHub issueHi
I was training citation model and everything is correctly detected except the URL. this is an example of my data training:
<bibl> <author>Azaola, Elena</author> (<date>2009</date>). <title level="a">El comercio con el dolor y la esperanza. La extorsión telefónica en México</title>. <title level="j">URVIO, Revista Latinoamericana de Estudios de Seguridad</title>, <biblScope unit="volume"></biblScope>(<biblScope unit="issue" type="issue">6</biblScope>), <biblScope unit="page">115-122</biblScope>. <idno type="ISSN"> ISSN: 1390-3691</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=552656559008</ptr> </bibl> <bibl> <author>Trejo Nieto, Alejandra</author> (<date>2013</date>). <title level="a">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>. <title level="j">Estudios Demográficos y Urbanos</title>, <biblScope unit="volume">28</biblScope>(<biblScope unit="issue" type="issue">3</biblScope>), <biblScope unit="page">545-591</biblScope>. <idno type="ISSN"> ISSN: 0186-7210</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=31230011001</ptr> </bibl>
Maybe I do something wrong but I can’t detect it
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
Hello !
The encoding of the results follows the TEI, so URL are encoded like this by definition:
<ptr>
has no type, and target URL is defined by the@target
attribute. Why do you think it is a problem?Maybe I can stress that the encoding of the training data is different from the encoding of the final processed result. Grobid parsing results are metadata, so normalized and independent from a particular order/presentation/serialization. It’s the format expected by a catalogue for instance.
Training data follow the input (for instance noisy token sequences from a PDF) and thus are not normalized. As they follow exactly the input string, the encoding is “inline”, identifying spans to be extracted, so content is never in an attribute (XML attributes must be normalized to avoid XML failures).
To generate pre-annotated training data format, you can use the batch method
createTraining
, which produces inline annotations on the exact input reference strings.I understand, sorry, the english is not my native language and sometimes I have this issues in my comprehension, I will be retrain the model and check, thanks for your time and patience