question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tika-Python on Windows: Tika server returns status 503

See original GitHub issue

I am testing Tika-Python on my Windows 10 laptop, but I cannot get it to work. Using the following Python script (directly taken from this site, with ‘path/to/file’ naturally changed to a correct filepath):

"""Test Apache Tika."""

import tika
tika.initVM()
from tika import parser

parsed = parser.from_file('path/to/file')

print(parsed['metadata'])
print(parsed['content'])

I get the following:

$ python test-tika.py
2018-11-22 10:24:34,112 [MainThread  ] [WARNI]  Tika server returned status: 503
Traceback (most recent call last):
  File "test-tika.py", line 7, in <module>
    parsed = parser.from_file('C:\\Users\\Christophe.Grandsire\\Cases\\Data\\AI in FDP\\Raw Pilot data\\magnus C&C.pdf')
  File "C:\Users\Christophe.Grandsire\.virtualenvs\Tika-OOOIfOBP\lib\site-packages\tika\parser.py", line 40, in from_file
    return _parse(jsonOutput)
  File "C:\Users\Christophe.Grandsire\.virtualenvs\Tika-OOOIfOBP\lib\site-packages\tika\parser.py", line 77, in _parse
    realJson = json.loads(jsonOutput[1])
  File "c:\program files\python37\Lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "c:\program files\python37\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\program files\python37\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Here is the contents of tika.log, which shows that the tika-server JAR was correctly downloaded, but that every attempt to use it returns a 503 status code:

2018-11-21 15:24:59,746 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\CHRIST~1.GRA\AppData\Local\Temp\tika-server.jar.
2018-11-21 15:27:44,758 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\CHRIST~1.GRA\AppData\Local\Temp\tika-server.jar.md5.
2018-11-21 15:27:45,319 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2018-11-21 15:28:03,322 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-21 15:35:25,203 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-21 15:35:56,649 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-22 10:23:32,192 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2018-11-22 10:23:55,192 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-22 10:24:34,112 [MainThread  ] [WARNI]  Tika server returned status: 503

And here is the latest tika-server.log file:

nov. 22, 2018 10:23:33 A.M. org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

nov. 22, 2018 10:23:33 A.M. org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
nov. 22, 2018 10:23:33 A.M. org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO  Starting Apache Tika 1.19 server
INFO  Setting the server's publish address to be http://0.0.0.0:9998/
INFO  Logging initialized @1629ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 11.0.1+13-LTS
INFO  Started ServerConnector@63648ee9{HTTP/1.1,[http/1.1]}{0.0.0.0:9998}
INFO  Started @2108ms
WARN  Empty contextPath
INFO  Started o.e.j.s.h.ContextHandler@1536602f{/,null,AVAILABLE}
INFO  Started Apache Tika server at http://0.0.0.0:9998/

Any idea what is going on here?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
BorisWiegandcommented, Apr 3, 2020

@BorisWiegand can you take a look at https://cwiki.apache.org/confluence/display/TIKA/TikaOCR and try to interact with Tika server that way? Does it work? that will isolate the problem to whether or not it’s an issue in python or your server setup.

Thank you very much for this hint. I am sitting behind a corporate proxy and I had a wrong configuration, such that my python script tried to connect via proxy to the local tika server. Actually not the tika server but the proxy server returned status code 503. Now, I have fixed my proxy settings and everythings works as expected.

1reaction
chrismattmanncommented, Mar 27, 2020

@BorisWiegand can you take a look at https://cwiki.apache.org/confluence/display/TIKA/TikaOCR and try to interact with Tika server that way? Does it work? that will isolate the problem to whether or not it’s an issue in python or your server setup.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - TIKA server returned status 500. I have latest version ...
This error 500 seems to be returned when Tika Server fails for reasons such as running out of memory in heap, and other...
Read more >
How to Fix the HTTP Error 503 Service Unavailable - Kinsta
The 503 (Service Unavailable) status code indicates that the server is currently unable to handle the request due to a temporary overload or ......
Read more >
CHANGES-1.20.txt - Apache Archives
Fix bug in tika-server when run in legacy mode (not -spawnChild) that caused it to return 503 on documents submitted after it hit...
Read more >
tika · PyPI
A Python port of the Apache Tika library that makes Tika available using the Tika REST Server. This makes Apache Tika available as...
Read more >
python/2193/tika-python/tika/tika.py - Program Talk
Windows = True if platform.system() = = "Windows" else False ... """Run the Tika command by calling the Tika server and return results...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found