Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support running Grobid in Apache Spark

See original GitHub issue

We are considering running Grobid in our Spark environment as part of the Semantic Scholar project. Currently, PDFs are shuffled to Spark nodes as RDDs of byte arrays which live in the nodes’ main memory. These byte arrays currently need to be output to a temporary directory which is used as value for -dIn argument. The extracted XMLs are then read from a temporary directory (the -dOut argument) into memory as byte array RDDs which then are processed by the next step in the Spark pipeline.

This current design incurs one disk write and one disk read per PDF. These IO operations can be removed if Grobid accept PDFs as byte arrays directly. To enable this, Grobid needs to call pdf2xml via a library API wrapped with JNI rather than shelling out to a separate process (correct me if I am wrong). This approach could also support multi-thread processing similar to Grobid’s REST service.

Thanks, Vu Ha. The Semantic Scholar project

Issue Analytics

State:
Created 9 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

kermitt2commented, Sep 15, 2021

Hi, I’m also interested in importing Grobid as a Java dependency in Scala code for Spark. Is this feature on the roadmap @kermitt2 ?

Re-hello @xegulon 😃

Is it different from using GROBID as a third party JAR in JVM?

https://grobid.readthedocs.io/en/latest/Grobid-java-library/

1reaction

rkarimicommented, Mar 22, 2018

Was there any progress made here? To run Grobid in Spark Nodes?

Read more comments on GitHub >

Top Results From Across the Web

Frequently asked Questions - GROBID Documentation

Frequently Asked Questions. When running processing a large quantity of files, I see many 503 errors. The 503 status returned by GROBID is...

Frequently Asked Questions - spark-rapids

What hardware is supported? How can I check if the RAPIDS Accelerator is installed and which version is running? What parts of Apache...

PDFMEF: A Multi-Entity Knowledge Extraction Framework for ...

the first place. Another example is that if GROBID is used for both header and citation extraction, it only needs to be run...

What's new with Apache Tika? - SlideShare

A presentation from ApacheCon Europe 2015 / Apache Big Data Europe ... if you've got files, Tika can help you get out useful...

Search Results for "offline artificial intelligence\" - Page 5

NET for Apache Spark provides high-performance APIs for using Apache Spark ... to be easily installed and used to get your chatbot up...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Invalid Format Error in OSX

ERROR: GROBID server does not appear up and running