question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GROBID 0.5.4 abstract extraction regression

See original GitHub issue

There are a few cases where abstracts extracted using GROBID 0.5.4 have regressed compared to 0.5.4.

Most of my examples are author submitted and not public but I found two on BioRxiv: (19 out of 100 documents have worse results, about 9 have no abstract at all)

The PDFs contain line numbers but I originally tested it after removing the line numbers.

I am getting the following extraction:

(1) NECAPs are negative regulators of the AP2 clathrin adaptor complex

0.5.3

        <profileDesc>
            <abstract>
                <p>11 Eukaryotic cells internalize transmembrane receptors via clathrin-mediated endocytosis, 12 but it remains unclear how the machinery underpinning this process is regulated. We recently 13 discovered that membrane-associated muniscin proteins such as FCHo and SGIP initiate 14 endocytosis by converting the AP2 clathrin adaptor complex to an open, active conformation 15 that is then phosphorylated (Hollopeter et al., 2014). Here we report that loss of ncap-1, the sole 16 C. elegans gene encoding an adaptiN Ear-binding Coat-Associated Protein (NECAP), bypasses 17 the requirement for FCHO-1. Biochemical analyses reveal AP2 accumulates in an open, 18 phosphorylated state in ncap-1 mutant worms, suggesting NECAPs promote the closed, 19 inactive conformation of AP2. Consistent with this model, NECAPs preferentially bind open and 20 phosphorylated forms of AP2 in vitro and localize with constitutively open AP2 mutants in vivo. 21 NECAPs do not associate with phosphorylation-defective AP2 mutants, implying that 22 phosphorylation precedes NECAP recruitment. We propose NECAPs function late in 23 endocytosis to inactivate AP2.</p>
            </abstract>
        </profileDesc>

0.5.4

        <profileDesc>
            <abstract>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0" />
                </abstract>
            </profileDesc>

(although I did get an extraction before)

(2) Intermittent Ca2+ signals mediated by Orai1 regulate basal T cell motility

0.5.3

        <profileDesc>
...
            <abstract>
                <p>20 Ca 2+ influx through Orai1 channels is crucial for several T cell functions, but a role in 21 regulating basal cellular motility has not been described. Here we show that inhibition of 22 Orai1 channel activity increases average cell velocities by reducing the frequency of 23 pauses in human T cells migrating through confined spaces, even in the absence of 24 extrinsic cell contacts or antigen recognition. Utilizing a novel ratiometric genetically 25 encoded cytosolic Ca 2+ indicator, Salsa6f, which permits real-time monitoring of cytosolic 26 Ca 2+ along with cell motility, we show that spontaneous pauses during T cell motility in 27 vitro and in vivo coincide with episodes of cytosolic Ca 2+ signaling. Furthermore, lymph 28 node T cells exhibited two types of spontaneous Ca 2+ transients: short-duration &quot;sparkles&quot; 29 and longer duration global signals. Our results demonstrate that spontaneous and self-30</p>
            </abstract>
        </profileDesc>

0.5.4

        <profileDesc>
...
            <abstract>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Abstract</head>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>20</head>
                    <p>Ca 2+ influx through Orai1 channels is crucial for several T cell functions, but a role in 21 regulating basal cellular motility has not been described. Here we show that inhibition of was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     3 To initiate the adaptive immune response, T cells must make direct contact with antigen- and peripheral tissues 
                        <ref type="bibr" target="#b52">(Miller, Wei et al. 2002</ref>
                        <ref type="bibr" target="#b6">, Bousso and Robey 2003</ref>, Mempel,   42   Henrickson et al. 2004
                        <ref type="bibr" target="#b56">, Mrass, Petravic et al. 2010</ref>. T cell motility in steady-state lymph 43    nodes under homeostatic conditions, referred to as "basal motility", has been likened to 44    diffusive Brownian motion, resembling a "stop-and-go" random walk that results in an 45 overall exploratory spread characterized by a linear mean-squared displacement over 46 time 
                        <ref type="bibr" target="#b52">(Miller, Wei et al. 2002)</ref>. Subsequent studies defined a role of cellular cues in guiding
                    </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>47</head>
                    <p>T cell migration, such as contact with the lymph node stromal cell network or short-term 48 encounters with resident dendritic cells (Miller, Hejazi et al. 2004, Bajenoff, Egen et al.   49    2006, 
                        <ref type="bibr" target="#b33">Khan, Headley et al. 2011)</ref>. Whereas the basic signaling mechanisms for cell-  Upon T cell recognition of cognate antigen, TCR engagement results in an 54 elevated cytosolic Ca 2+ concentration that acts as a "STOP" signal to halt motility and 55 anchor the T cell to the site of antigen presentation (Donnadieu, Bismuth et al. 1994,   56    All rights reserved. No reuse allowed without permission.
                    </p>
                    <p>was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     4 Negulescu, 
                        <ref type="bibr" target="#b57">Krasieva et al. 1996</ref>
                        <ref type="bibr" target="#b13">, Dustin, Bromley et al. 1997</ref>
                        <ref type="bibr" target="#b4">, Bhakta, Oh et al. 2005</ref> Moreau, 
                        <ref type="bibr" target="#b53">Lemaitre et al. 2015</ref>   However, despite their contributions to other aspects of T cell function, no role has been 74 identified for Orai1 channels in T cell motility patterns underlying scanning behavior.
                    </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>75</head>
                    <p>In this study, we use human and mouse T cells to assess the role of Orai1 and 76 Ca 2+ ions in regulating basal cell motility. Expression of a dominant-negative Orai1-E106A 77 construct was used to block Orai1 channel activity in human T cells, both in vivo within 78 immunodeficient mouse lymph nodes 
                        <ref type="bibr" target="#b26">(Greenberg, Yu et al. 2013)</ref>, and in vitro within was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                    </p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     6</p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Results</head>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>89</head>
                    <p>Inhibition of Orai1 in human T cells using a dominant-negative construct 90    To study the role of Orai1 channel activity in T cell motility, we transfected human T cells Ca 2+ permeation in a potent dominant-negative manner 
                        <ref type="bibr" target="#b26">(Greenberg, Yu et al. 2013</ref>).
                    </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>97</head>
                    <p>Using Fura-2 based Ca 2+ imaging, we confirmed Orai1 channel block by E106A in was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     7 Orai1 channel activity, and that transfected cells without detectable eGFP fluorescence 111 can be used as an internal control.</p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>112</head>
                    <p>Orai1 function in human T cell motility was evaluated in vivo using a human </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Orai1 block increases human T cell motility within intact lymph node</head>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>131</head>
                    <p>To evaluate Orai1 function in T cell motility, we imaged human T cells within intact lymph 132 nodes of reconstituted NOD.SCID.β2 mice by two-photon microscopy ( 
                        <ref type="figure">Figure 2A</ref>). We was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                    </p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1, 2017; 8 found that eGFP-E106A hi T cells migrated with significantly higher average velocities than </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>147</head>
                    <p>To replicate our findings in a different immunodeficient mouse model, we repeated 148 our human T cell adoptive transfer protocol using NOD.SCID mice depleted of NK cells.</p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>149</head>
                    <p>Lymph nodes in these mice are small and contain reticular structures but are completely was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1, 2017; 9 NOD.SCID.β2 lymph nodes migrated at similar speeds to wildtype mouse T cells in vivo 157 
                        <ref type="bibr" target="#b52">(Miller, Wei et al. 2002)</ref>, reconstitution results in a lymph node environment that more 158 closely mimics normal physiological conditions. Furthermore, the greater effect of Orai1 159 block on T cell arrest coefficients in crowded reconstituted lymph nodes suggests that 160 Orai1's role in motility is more pronounced in crowded cell environments.  was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                    </p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1, 2017;    was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <note type="other">rights reserved. No reuse allowed without permission. was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1,</note>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1, 2017; 12 strong inverse relationship: highly motile T cells always exhibited baseline Ca 2+ levels, 226 while elevated Ca 2+ levels were only found in slower or arrested T cells ( 
                        <ref type="figure">Figure 5D</ref>). It is 227 important to note that these Ca 2+ signals and reductions in velocity occurred in the 228 absence of any extrinsic cell contact or antigen recognition, indicating that Ca 2+ 229 elevations, like pausing and Orai1 activation, can be triggered in a cell-intrinsic manner.
                    </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>230</head>
                    <p>To compare the effects of Orai1 activity on the motility of T cells in a less confined  was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1, 2017; 13 together, these experiments establish a role for Orai1 channels and Ca 2+ influx in 248 modulating T cell motility within confined environments. was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     14 similar motility coefficients 
                        <ref type="figure">(Figure 6H, I)</ref>. Altogether, motility characteristics of Salsa6f T that the Ca 2+ rise is clearly associated with a decrease in velocity 
                        <ref type="figure">(Figure 7C and D)</ref>. was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
                    </p>
                    <p>The   was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <note type="other">copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1,</note>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online Nov. 1, 2017;</p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>16</head>
                    <p>There was also significant variation in the number of Ca 2+ transients in ITC antibody and  was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     17</p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Discussion</head>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>326</head>
                    <p>In this study, we demonstrate that Orai1 channel activity regulates motility patterns that </p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>347</head>
                    <p>These heteromers appear to simply reduce the flow of Ca 2+ through the Orai1 channel 348 All rights reserved. No reuse allowed without permission.</p>
                    <p>was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.   was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.</p>
                    <p>The copyright holder for this preprint (which . http://dx.doi.org/10.1101/212274 doi: bioRxiv preprint first posted online     20 events occur frequently as T cell migrate through the lymph node, and Ca 2+ transients 395 are associated with pauses in motility, we propose that spontaneously generated Orai1-396 dependent pauses and turns can be triggered by T cell-APC interaction through MHC 397 proteins.</p>
                </div>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>398</head>
                    <p>However, we find evidence for MHC-independent triggering of Ca 2+ signaling and  </p>
                </div>
            </abstract>
        </profileDesc>

(I am getting a similar long text after line numbers have been removed)

/cc @kermitt2

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
de-codecommented, Sep 16, 2019

I’ve re-run the evaluation. There may be minor differences due to the conversion errors. But in general I can confirm the results on our dataset as well:

image

i.e. the configuration is probably not needed.

Thank you for looking into and fixing the issue.

1reaction
kermitt2commented, Sep 12, 2019

Many thanks Daniel for raising the problems with use case, it was super useful.

PR #486 solves all these issues. There were quite a lot of bugs (4-5!) in the process of structuring the abstract. If you could test again your non-sharable documents, that would be great. If you still see some issues, don’t hesitate to report them.

From the benchmark on the 1942 pubmed central PDF, there is no regression any more with grobid version 0.5.3, 0.5.4 or earlier when using the abstract structuring. On the contrary, accuracy is progressing nicely, with the benefit of the structures, citation callout matching and so on.

After merging the PR, we have the following for abstracts with structuring:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             94.88        82.11        75.88        78.87  

compared to f-score of 75.18 (v. 0.5.5), 72.83 (v. 0.5.4) and 77.71 (v 0.5.3)

Read more comments on GitHub >

github_iconTop Results From Across the Web

grobid-04-2015.pdf
A text mining library for extracting bibliographical metadata at large - started in 2008 (first as a hobby ;). Problem:.
Read more >
GROBID: Structured text from PDFs | Data, code and science
It can extract scholarly units, such as references, affiliations, authors, DOIs, and abstracts by utilizing machine learning and deep learning.
Read more >
lfoppiano/grobid-sandbox - Docker Image
Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, ...
Read more >
Semaine du document numérique et de la recherche d ...
Ines Bannour, Haïfa Zargayouna Extraction de relations n-aires interphrastiques guidée par une ... Science and Engineering, University of Salford ABSTRACT.
Read more >
grobid: A machine learning software for extracting ... - Gitee
Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found