question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tika not working with custom jar path

See original GitHub issue

I’m working on a Python module that uses Tika, and I’m trying to use a custom jar file so that it does not get downloaded each time

I have already placed the jar file and the md5 file inside the module

my_module
========
      __init__.py
      package1
      package2
      package3
          __init__.py
          pdf.py
          tika-server.jar
          tika-server.jar.md5
pdf.py
====

import os
from tika import tika, parser
tika.TikaJarPath = os.path.dirname(__file__)

def get_pdf_text(path):
    parsed = parser.from_file(path)
    return parsed['content']

Tika does not work and this is the output :

 [WARNI]  Failed to see startup log message; retrying...
 [WARNI]  Failed to see startup log message; retrying...
 [WARNI]  Failed to see startup log message; retrying...
 [ERROR]  Tika startup log message not received after 3 tries.

The problem happens when the jar file is inside the module. It works if I specify another location, but that’s not an option because when I deploy the Python module, I need the jar file to contain it.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
RafayGhafoorcommented, Jun 20, 2019

@amonaldo, You need to specify the absolute path to the parameter of dirname which would become like this:

os.path.join(os.getcwd(), __file__)

Moreover, you need to override three variables of tika module i.e., log_path, TikaJarPath,TikaFilesPath in order to make your modified script work.

Modify your pdf.py (updating the filename):

import os
from tika import tika, parser

abs_path = os.path.dirname(os.path.join(os.getcwd(), __file__)) # Store the absolute path of your file (containing .jar)

# Update the required variables
tika.log_path = os.getenv('TIKA_LOG_PATH', abs_path)
tika.TikaJarPath = os.getenv('TIKA_PATH', abs_path)
tika.TikaFilesPath = os.path.dirname(os.path.join(os.getcwd(), __file__))


def get_pdf_text(path):
    parsed = parser.from_file(path)
    return parsed['content']


if __name__ == "__main__":
    pdf_name = "TEST_FILE_NAME" # filename to test
    print(get_pdf_text(pdf_name))
0reactions
amonaldocommented, Jun 21, 2019

@RafayGhafoor Thanks for your time, but I have found a solution although it’s not perfect.

I realized that I can get the user home directory using the os module

tika.TikaJarPath = os.path.expanduser("~")

This way Tika works fine and without any problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

External jar file problem (TIKA) - support - Lucee Dev
Hey everybody! I'm having some difficulties implementing a well-known jar library called 'Tika'. This is used to parse files to readable ...
Read more >
How to use a Tika custom parser in a jar file? - Stack Overflow
To install a plugin, download it according to instructions below and drop the jar(s) on your classpath. Tika will auto detect the plugin....
Read more >
tika-parsers not usable on module path (Java 11) - Apache
jar Caused by: java.lang.module.InvalidModuleDescriptorException: Provider class org.apache.tika.parser.external.CompositeExternalParser not in ...
Read more >
Chapter 2. Getting started with Tika - Tika in Action
The quick-and-easy way to get started with Tika is to use the Tika application, a standalone JAR archive that contains everything you need...
Read more >
TIKA - Quick Guide - Tutorialspoint
... file tika-app-1.6.jar. Add the complete path of the jar file as shown in the table below. ... To resolve this problem, Tika...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found