question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tika 1.24.1 and gzip compression

See original GitHub issue

Hello, Tika released 1.24.1 which allows gzip compression of input and output streams for tika-server.

What do you think of making it the default for the output stream? Since requests automatically decodes gzip and deflate transfer-encodings it’s just adding the header Accept-Encoding: gzip, deflate to services rmeta, tika, rmeta/text.

I can provide a PR.

Cheers, Carina

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
carantunescommented, Jun 29, 2020

Hi,

After some tests and benchmarks I’ve reconsidered if it should be changed by default. Gzip compression has the upside of improving transfer speed and bandwidth utilisation (~75%), at the cost of some cpu utilisation. For large files may be an improvement.

Another difference is that files sent to Tika with compression will have a different Content-Type returned (ie, from ‘application/pdf’ to [‘application/gzip’, ‘application/pdf’])

Instead I believe it would be sufficient to support sending/receiving gzip format by releasing 1.24.1

Input compression can be achieved with gzip or zlib:

    with open(file, 'rb') as file_obj:
        return tika.parser.from_buffer(zlib.compress(file_obj.read()))

...

    with open(file, 'rb') as file_obj:
        return tika.parser.from_buffer(gzip.compress(file_obj.read()))

And output with the header:

    with open(file, 'rb') as file_obj:
        return tika.parser.from_file(file_obj, headers={'Accept-Encoding': 'gzip, deflate'})

A sample of benchmark (using pytest-benchmark) results using a ppt (100MB), run with the default timer first and lastly with --benchmark-timer=time.process_time which doesn’t include sleeping time or waiting for I/O: Screenshot 2020-06-29 at 20 44 41

0reactions
carantunescommented, Aug 17, 2020

@tballison Pardon my delay, I was on vacation. Thanks for the input, I had not notice there was a difference.

After some debugging it looks like if I send -H "Content-Encoding: application/gzip" to rmeta I get a different result that if I send with -H "Content-Encoding: gzip" or with no header at all. I’ve created a ticket with more details if you want to further investigate/explain it https://issues.apache.org/jira/browse/TIKA-3169.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Tika 1.24.1
The most notable changes in Tika 1.24.1 over the previous release are: Allow gzip compression of input and output streams for tika-server (TIKA-3073)....
Read more >
Central Repository: org/apache/tika/tika/1.24.1
org/apache/tika/tika/1.24.1 ../ tika-1.24.1-src.zip 2020-04-17 21:28 120806077 tika-1.24.1-src.zip.asc 2020-04-17 21:28 833 tika-1.24.1-src.zip.md5 ...
Read more >
Log - HEAD - tika - Git at Google
... 18196fd TIKA-3087 -- general upgrades for 1.24.1 by tallison · 2 years, ... gz compression of input and output streams in tika-server...
Read more >
svn commit: r46722 [2/3] - /dev/tika/
+ +Release 1.24.1 - 4/17/2020 + + * Allow gzip compression of input and output ... + + * Upgraded to Commons-Compress 1.18...
Read more >
Release 2.2.1 - 12/19/2021 - FTP Unicamp
Release 1.24.1 - 4/17/2020 * Allow gzip compression of input and output streams for tika-server (TIKA-3073). Release 1.24 - 3/11/2019 * Add scripts...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found