Native-Image: Newly created project with Tika extension can't extract anything
See original GitHub issueDescribe the bug I tried creating a Tika project following the documentation on Quarkus. I think any pdf would produce the same result, but here’s the one I tried. americanexpress_01.pdf
Expected behavior To give a little context, I should be able to extract any text from any document tika supports, so I simply need to have the whole Tika library bundle in the native application, which doesn’t seem to be what is happening right now.
Actual behavior With the documentation’s example, when doing the curl post with the PDF above, I get.
Error: Could not find referenced cmap stream Identity-H
Looking at the stacktrace, I realise that this resource is missing inside PDFBox (a tika dependency) inside the bundled native-image, along with all other tika dependencies…
Configuration
quarkus.package.type=native
quarkus.package.uber-jar=true
quarkus.log.console.enable=false
quarkus.native.add-all-charsets=true
quarkus.native.additional-build-args=-H:ReflectionConfigurationFiles=reflection-config.json,-H:ResourceConfigurationFiles=resources-config.json
What I tried
To test really fast I tried adding a resources-config.json
to the project.
{
"resources": [
{
"pattern": ".*"
}
]
}
and now success, the native app can extract the text of my file (yeah). But, trying different PDFs leaded to different errors, and this time they were runtime errors.
So, considering I followed the documentation properly and that the native-image build process ends without any error, I’m really wondering what i’m doing wrong, since this is supposed to parse any files and it doesn’t, because there seems to be missing A LOT of dependencies inside the native-image.
Is the Quarkus Tika extension really supposed to be usable, or do I have to create the extension with all I need myself as done in the TikaProcessor.java (which doesn’t seem to include many things)
For my use-case, is there a way I can include everything from Tika ?
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (9 by maintainers)
You’re right, the link seems dead. There is this medium post that explains well the base configuration.
Otherwise, Oracle has a more stable documentation here
@pelletier197 OK, thanks, I’ll then go ahead and merge the PR. Yes, I’m aware of that folder, I copied one of the files (renamed it) to
integration-tests/tika
awhile back. I agree a more exhaustive testing would help, but testing a large set as part of the regular Quarkus build may be problematic. For now I’m trying to address the issues as they are reported though I’m late on a few of them (POI is not happy in the native mode in particuar). Thanks