question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compressed file ended before the end-of-stream marker was reached

See original GitHub issue

I try to download files from third-party API and upload them to S3.

        with open(url, 'rb') as fin:
            with open(f"s3://{BUCKET}/{fpath}", 'wb') as fout:
                for line in fin:
                    fout.write(line)

Mostly smart_open works fine with other files but sometimes I’m getting an error

ERROR:root:Compressed file ended before the end-of-stream marker was reached

For example I’m getting an error with this file https://datasets.tardis.dev/v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz but with browser I can download it without issues.

My log:

INFO:root:Opened https://datasets.tardis.dev/v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:smart_open.s3:smart_open.s3.MultipartWriter('my-bucket', 'tardis/table_20200801.csv.gz'): uploading part_num: 1, 52456530 bytes (total 0.049GB)
INFO:smart_open.s3:smart_open.s3.MultipartWriter('my-bucket', 'tardis/table_20200801.csv.gz'): uploading part_num: 2, 52437910 bytes (total 0.098GB)
INFO:smart_open.s3:smart_open.s3.MultipartWriter('my-bucket', 'tardis/table_20200801.csv.gz'): uploading part_num: 3, 5235580 bytes (total 0.103GB)
ERROR:root:Compressed file ended before the end-of-stream marker was reached

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
ivan-yankovyicommented, Jun 23, 2021
from smart_open import open


BUCKET = 'my-bucket'

def load_file(url, fpath):
    with open(url, 'rb') as fin:
        with open(f"s3://{BUCKET}/{fpath}", 'wb') as fout:
            for line in fin:
                fout.write(line)

load_file('https://datasets.tardis.dev/v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz', 'tardis//table_20200801.csv.gz')
Traceback (most recent call last):
  File "tardis_example.py", line 12, in <module>
    load_file('https://datasets.tardis.dev/v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz', 'tardis/table_20200801.csv.gz')
  File "tardis_example.py", line 9, in load_file
    for line in fin:
  File "/usr/lib/python3.8/gzip.py", line 390, in readline
    return self._buffer.readline(size)
  File "/usr/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.8/gzip.py", line 498, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
0reactions
mpenkovcommented, Oct 11, 2021

@ivan-yankovyi Is it possible that the connection that’s reading is breaking in some way?

I don’t think so. As I remember, the problem occurs with the same files.

It actually is the inbound connection breaking. Here’s code that reproduces it:

import http.client
import io
import logging
import time

import requests

logging.basicConfig(level=logging.ERROR)
logging.getLogger('smart_open').setLevel(logging.DEBUG)
logging.getLogger('urllib3').setLevel(logging.DEBUG)
http.client.HTTPConnection.debuglevel = 1

url = 'https://datasets.tardis.dev/v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz'
_HEADERS = {'Accept-Encoding': 'identity'}
response = requests.get(url, stream=True, headers=_HEADERS, timeout=None)
if not response.ok:
    response.raise_for_status()

counter = 0
part_remaining = 50 * 1024 ** 2
for buf in response.iter_content(io.DEFAULT_BUFFER_SIZE):
    counter += len(buf)
    part_remaining -= len(buf)
    if part_remaining < 0:
        percent = counter / int(response.headers['Content-Length']) * 100
        print(f'read {counter // 1024**2} MB ({int(percent)}%)')
        # emulate a multipart part upload
        time.sleep(30)
        part_remaining = 50 * 1024 ** 2

percent = counter / int(response.headers['Content-Length']) * 100
print(f'read {counter // 1024**2} MB ({int(percent)}%)')

If you run that several times, you’ll see that it never makes it to the end of the download (around 570MB).

$ python 627-requests.py
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): datasets.tardis.dev:443
send: b'GET /v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz HTTP/1.1\r\nHost: datasets.tardis.dev\r\nUser-Agent: python-requests/2.26.0\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Mon, 11 Oct 2021 14:02:08 GMT
header: Content-Type: text/csv
header: Content-Length: 597911151
header: Connection: keep-alive
header: Cache-Control: public, max-age=155520000, immutable
header: Content-Disposition: attachment; filename="deribit_options_chain_2020-08-01_OPTIONS.csv.gz"
header: x-name: v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=BZEqQ9RIWgpHLntzoDBvP5otp06jz1MkZ%2B%2FVwKXHzAOUpaWri2tBuoCbbGCtuJJUQUV88KADG4nr41BTxhiiFDVOWMyv9U2hnvX7cAzT%2B2MjcisnSNgV9oksuGCw5fr3Z6aKD3h2FW5laq6bbXFHd6U%3D"}],"group":"cf-nel","max_age":604800}
header: NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 69c89fcb5d311f33-NRT
DEBUG:urllib3.connectionpool:https://datasets.tardis.dev:443 "GET /v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz HTTP/1.1" 200 597911151
read 50 MB (8%)
read 100 MB (17%)
read 103 MB (18%)

If you reduce the sleep time to e.g. 1s, you’ll be able to stream that file from the server successfully.

I think the server is dropping the connection if it thinks the client is idling for too long. Unfortunately, this “idling” is what smart_open is using to upload the multipart part to S3. You can attempt to reduce it by using a smaller multipart part size, e.g. from your original code

import boto3
from smart_open import open

BUCKET = 'smart-open'
session = boto3.Session(profile_name='smart_open')
client = session.client('s3')


def load_file(url, fpath):
    with open(url, 'rb') as fin:
        with open(f"s3://{BUCKET}/{fpath}", 'wb', transport_params={'client': client, 'min_part_size': 5*1024**2}) as fout:
            for i, line in enumerate(fin):
                if i and i % 1000000 == 0:
                    print(f'read line #{i}')
                fout.write(line)

load_file('https://datasets.tardis.dev/v1/deribit/options_chain/2020/08/01/OPTIONS.csv.gz', 'tardis//table_20200801.csv.gz')

Here I’m using a 5MB instead of the default 50MB.

I’m going to close this for now because it really isn’t a problem with smart_open. The server is flat out hanging up on us in a way that we cannot easily detect.

Read more comments on GitHub >

github_iconTop Results From Across the Web

EOFError: Compressed file ended before the end-of-stream ...
I am getting the following error when I run mnist = input_data.read_data_sets("MNIST_data", one_hot = True) . EOFError: Compressed file ended ...
Read more >
EOFError: Compressed file ended before the ... - QIIME 2 Forum
Compressed file ended before the end-of-stream marker was reached. Debug info has been saved to /tmp/qiime2-q2cli-err-qper0vdr.log.
Read more >
[traceback] "EOFError: Compressed file ended before the end ...
[traceback] "EOFError: Compressed file ended before the end-of-stream marker was reached" in OutOfMemoryBinaryRule.
Read more >
EOFError: Compressed file ended ... - NVIDIA Developer Forums
EOFError: Compressed file ended before the end-of-stream marker was reached ... Dear Sir or Madam: While I run the babi_rnn.py from keras/examples ...
Read more >
Python – EOFError: Compressed file ended before the end-of ...
Python – EOFError: Compressed file ended before the end-of-stream marker was reached – MNIST data set. pythontensorflow. I am getting the following error ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found