Support running Grobid in Apache Spark
See original GitHub issueWe are considering running Grobid in our Spark environment as part of the Semantic Scholar project. Currently, PDFs are shuffled to Spark nodes as RDDs of byte arrays which live in the nodes’ main memory. These byte arrays currently need to be output to a temporary directory which is used as value for -dIn
argument. The extracted XMLs are then read from a temporary directory (the -dOut
argument) into memory as byte array RDDs which then are processed by the next step in the Spark pipeline.
This current design incurs one disk write and one disk read per PDF. These IO operations can be removed if Grobid accept PDFs as byte arrays directly. To enable this, Grobid needs to call pdf2xml via a library API wrapped with JNI rather than shelling out to a separate process (correct me if I am wrong). This approach could also support multi-thread processing similar to Grobid’s REST service.
Thanks, Vu Ha. The Semantic Scholar project
Issue Analytics
- State:
- Created 9 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top GitHub Comments
Hi, I’m also interested in importing Grobid as a Java dependency in Scala code for Spark. Is this feature on the roadmap @kermitt2 ?
Re-hello @xegulon 😃
Is it different from using GROBID as a third party JAR in JVM?
https://grobid.readthedocs.io/en/latest/Grobid-java-library/
Was there any progress made here? To run Grobid in Spark Nodes?