question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support running Grobid in Apache Spark

See original GitHub issue

We are considering running Grobid in our Spark environment as part of the Semantic Scholar project. Currently, PDFs are shuffled to Spark nodes as RDDs of byte arrays which live in the nodes’ main memory. These byte arrays currently need to be output to a temporary directory which is used as value for -dIn argument. The extracted XMLs are then read from a temporary directory (the -dOut argument) into memory as byte array RDDs which then are processed by the next step in the Spark pipeline.

This current design incurs one disk write and one disk read per PDF. These IO operations can be removed if Grobid accept PDFs as byte arrays directly. To enable this, Grobid needs to call pdf2xml via a library API wrapped with JNI rather than shelling out to a separate process (correct me if I am wrong). This approach could also support multi-thread processing similar to Grobid’s REST service.

Thanks, Vu Ha. The Semantic Scholar project

Issue Analytics

  • State:open
  • Created 9 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
kermitt2commented, Sep 15, 2021

Hi, I’m also interested in importing Grobid as a Java dependency in Scala code for Spark. Is this feature on the roadmap @kermitt2 ?

Re-hello @xegulon 😃

Is it different from using GROBID as a third party JAR in JVM?

https://grobid.readthedocs.io/en/latest/Grobid-java-library/

1reaction
rkarimicommented, Mar 22, 2018

Was there any progress made here? To run Grobid in Spark Nodes?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Frequently asked Questions - GROBID Documentation
Frequently Asked Questions. When running processing a large quantity of files, I see many 503 errors. The 503 status returned by GROBID is...
Read more >
Frequently Asked Questions - spark-rapids
What hardware is supported? How can I check if the RAPIDS Accelerator is installed and which version is running? What parts of Apache...
Read more >
PDFMEF: A Multi-Entity Knowledge Extraction Framework for ...
the first place. Another example is that if GROBID is used for both header and citation extraction, it only needs to be run...
Read more >
What's new with Apache Tika? - SlideShare
A presentation from ApacheCon Europe 2015 / Apache Big Data Europe ... if you've got files, Tika can help you get out useful...
Read more >
Search Results for "offline artificial intelligence\" - Page 5
NET for Apache Spark provides high-performance APIs for using Apache Spark ... to be easily installed and used to get your chatbot up...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found