okhttp fails with IOException: gzip finished without exhausting source but GZIPInputStream works
See original GitHub issueWhat kind of issue is this?
- Bug report. If you’ve found a bug, spend the time to write a failing test. Bugs with tests get fixed. Here’s an example: https://gist.github.com/swankjesse/981fcae102f513eb13ed
This issue can’t be reproduced in a test. I’ll do my best to explain.
>> GET http://myserver.mycompany.com/.../businesses.20180104.json.gz
<< 200 OK
<< connection -> [keep-alive]
<< accept-ranges -> [bytes]
<< content-disposition -> [attachment; filename="businesses.20180104.json.gz"; filename*=UTF-8''businesses.20180104.json.gz]
<< content-type -> [application/x-gzip]
<< content-length -> [3384998203]
<< date -> [Fri, 05 Jan 2018 00:43:32 GMT]
<< etag -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]
<< last-modified -> [Thu, 04 Jan 2018 23:54:26 GMT]
<< x-artifactory-id -> [9732f56568ea1e3d:59294f65:160b8066066:-8000]
<< x-checksum-md5 -> [451ca1b1414e7b511de874e61fd33eb2]
<< x-artifactory-filename -> [businesses.20180104.json.gz]
<< server -> [Artifactory/5.3.0]
<< x-checksum-sha1 -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]
As you can see, the server doesn’t set a Content-Encoding = gzip
header, so I do that in an interceptor.
Each record is newline delimited JSON string that is inserted into Couchbase. There are around 12 million records in total. Using okhttp, processing fails after about 130000 with the following exception:
Caused by: java.io.IOException: gzip finished without exhausting source
at okio.GzipSource.read(GzipSource.java:100)
at okio.RealBufferedSource$1.read(RealBufferedSource.java:430)
However, if I don’t set the Content-Encoding
header (thus skipping GzipSource
), and wrap the input stream with GZIPInputStream
, everything works as expected. I’ve also tried setting Transfer-Encoding = chunked
on the response and removing the Content-Length
header, but to no avail.
So, question is, if GZIPInputStream
doesn’t have a problem, why does GzipSource
? And since it does, why won’t it report what it thinks is the issue? I’ve a test that runs on a smaller file, 100 records, and it works.
I’ve seen https://github.com/square/okhttp/issues/3457, but unlike the reporter, it’s not possible for me to capture the hex body of a 3.4 GB stream.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:44 (12 by maintainers)
Top GitHub Comments
I introduced this behavior and can explain it.
Gzip is a self-terminating format. The content of the stream itself indicates where you’ve read everything.
If ever there’s data beyond the self-reported end, this data is effectively unreachable. This is potentially problematic for two reasons:
I made things strict to help to detect problems like this. It’s possible this check is too strict and we should silently ignore the extra data.
@yschimke I think there was a mistake in how I took the previous hex dump. This time, I did the following:
Put a breakpoint on line 3 of
JdkZlibDecoder.decode
above. For every invocation of it, dump the contents of theByteBuf
to a file by manually invoking the following method that I wrote:ByteBufUtils.dumpByteBuf("yelp-dump.txt", in)
That produced the attached dump, and I see that it starts with
1f 8b
, as well contains the same sequence more than once. Does this prove my theory of multiple streams?yelp-dump.txt