Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Native-Image: Newly created project with Tika extension can't extract anything

See original GitHub issue

Describe the bug I tried creating a Tika project following the documentation on Quarkus. I think any pdf would produce the same result, but here’s the one I tried. americanexpress_01.pdf

Expected behavior To give a little context, I should be able to extract any text from any document tika supports, so I simply need to have the whole Tika library bundle in the native application, which doesn’t seem to be what is happening right now.

Actual behavior With the documentation’s example, when doing the curl post with the PDF above, I get.

Error: Could not find referenced cmap stream Identity-H

Looking at the stacktrace, I realise that this resource is missing inside PDFBox (a tika dependency) inside the bundled native-image, along with all other tika dependencies…

Configuration

quarkus.package.type=native
quarkus.package.uber-jar=true
quarkus.log.console.enable=false
quarkus.native.add-all-charsets=true
quarkus.native.additional-build-args=-H:ReflectionConfigurationFiles=reflection-config.json,-H:ResourceConfigurationFiles=resources-config.json

What I tried To test really fast I tried adding a resources-config.json to the project.

{
  "resources": [
    {
      "pattern": ".*"
    }
  ]
}

and now success, the native app can extract the text of my file (yeah). But, trying different PDFs leaded to different errors, and this time they were runtime errors.

So, considering I followed the documentation properly and that the native-image build process ends without any error, I’m really wondering what i’m doing wrong, since this is supposed to parse any files and it doesn’t, because there seems to be missing A LOT of dependencies inside the native-image.

Is the Quarkus Tika extension really supposed to be usable, or do I have to create the extension with all I need myself as done in the TikaProcessor.java (which doesn’t seem to include many things)

For my use-case, is there a way I can include everything from Tika ?

Issue Analytics

State:
Created 3 years ago
Comments:18 (9 by maintainers)

Top GitHub Comments

1reaction

pelletier197commented, Aug 26, 2020

You’re right, the link seems dead. There is this medium post that explains well the base configuration.

Otherwise, Oracle has a more stable documentation here

0reactions

sberyozkincommented, Oct 3, 2020

@pelletier197 OK, thanks, I’ll then go ahead and merge the PR. Yes, I’m aware of that folder, I copied one of the files (renamed it) to integration-tests/tika awhile back. I agree a more exhaustive testing would help, but testing a large set as part of the regular Quarkus build may be problematic. For now I’m trying to address the issues as they are reported though I’m late on a few of them (POI is not happy in the native mode in particuar). Thanks