question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Native-Image: Newly created project with Tika extension can't extract anything

See original GitHub issue

Describe the bug I tried creating a Tika project following the documentation on Quarkus. I think any pdf would produce the same result, but here’s the one I tried. americanexpress_01.pdf

Expected behavior To give a little context, I should be able to extract any text from any document tika supports, so I simply need to have the whole Tika library bundle in the native application, which doesn’t seem to be what is happening right now.

Actual behavior With the documentation’s example, when doing the curl post with the PDF above, I get.

Error: Could not find referenced cmap stream Identity-H

Looking at the stacktrace, I realise that this resource is missing inside PDFBox (a tika dependency) inside the bundled native-image, along with all other tika dependencies…

Configuration

quarkus.package.type=native
quarkus.package.uber-jar=true
quarkus.log.console.enable=false
quarkus.native.add-all-charsets=true
quarkus.native.additional-build-args=-H:ReflectionConfigurationFiles=reflection-config.json,-H:ResourceConfigurationFiles=resources-config.json 

What I tried To test really fast I tried adding a resources-config.json to the project.

{
  "resources": [
    {
      "pattern": ".*"
    }
  ]
}

and now success, the native app can extract the text of my file (yeah). But, trying different PDFs leaded to different errors, and this time they were runtime errors.

So, considering I followed the documentation properly and that the native-image build process ends without any error, I’m really wondering what i’m doing wrong, since this is supposed to parse any files and it doesn’t, because there seems to be missing A LOT of dependencies inside the native-image.

Is the Quarkus Tika extension really supposed to be usable, or do I have to create the extension with all I need myself as done in the TikaProcessor.java (which doesn’t seem to include many things)

For my use-case, is there a way I can include everything from Tika ?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
pelletier197commented, Aug 26, 2020

You’re right, the link seems dead. There is this medium post that explains well the base configuration.

Otherwise, Oracle has a more stable documentation here

0reactions
sberyozkincommented, Oct 3, 2020

@pelletier197 OK, thanks, I’ll then go ahead and merge the PR. Yes, I’m aware of that folder, I copied one of the files (renamed it) to integration-tests/tika awhile back. I agree a more exhaustive testing would help, but testing a large set as part of the regular Quarkus build may be problematic. For now I’m trying to address the issues as they are reported though I’m late on a few of them (POI is not happy in the native mode in particuar). Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tika 1.4 - Cannot make it extract meta data or content
I have created a small java test project locally in my NetBeans IDE (7.4 on Mac OSX) in order to extract content and...
Read more >
Get Tika parsing up and running in 5 minutes - Apache Tika
Get Tika parsing up and running in 5 minutes. This page is a quick start guide showing how to add a new parser...
Read more >
Content Analysis with Apache Tika | Baeldung
Learn how to detect document types and extract content from documents with Java and Apache Tika.
Read more >
Home of Quarkus Cheat-Sheet - GitHub Pages
Getting Started; Gradle; Packaging; Command mode; Extensions ... Also focused on developer experience, making things just work with little to no ...
Read more >
Chapter 7. Extracting text with Tika - Lucene in Action, Second ...
Tika was added to the Lucene umbrella in October 2008, after graduating from the Apache incubator, which is the process newly created projects...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found