Timeout/Error when processing an MDPI PDF
See original GitHub issueHello!
I successfully deployed a grobid-server and I was parsing some articles. Everything was smooth until I found this paper: https://www.mdpi.com/2076-3921/11/2/251
I downloaded its PDF and tried to parse it with the following command:
curl -v --form input=@./mdpi_article.pdf localhost:8070/api/processFulltextDocument
And I get an XML with the following content:
[TIMEOUT] PDF to XML conversion timed out
When trying instead the processHeaderDocument
command, everything works as expected and the article headers (title, abstract, etc.) gets parsed in a good way.
curl -v --form input=@./mdpi_article.pdf localhost:8070/api/processHeaderDocument
This is the error I got:
ERROR [2022-04-06 14:07:25,987] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 143. [/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /opt/grobid/grobid-home/tmp/origin3229937128031954158.pdf, /opt/grobid/grobid-home/tmp/4wDQxkvfeZ.lxml]
ERROR [2022-04-06 14:07:25,987] org.grobid.core.process.ProcessPdfToXml: pdfalto return message:
ERROR [2022-04-06 14:07:25,988] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
You mention here that this error can be a bit of anything. So let me know if you need more data for replicating the error. The server config is set on 10 threads and a timeout of 120 seconds, though I get this “timeout error” after 20 sec.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
@kermitt2 I can confirm that upgrading to version
0.7.1
solved the issue. In a parallel universe, it would be really nice to know what was it and how it was fixed. But, I am more than content with this being solved!Thanks for the tip and sorry for the delay, a lot of stuff happened during Easter holidays that had priority.
@mazzespazze thanks for the feedback and good that this PDF is working now too ! To be honest I cannot immediately point to one of the fixes we made in the last months that has solved this problem, this is actually quite time consuming to track the problems back to the PDF parsing - which is likely where the trouble was taking place - so relatd to pdfalto or the interface with pdfalto.