Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Upload big io.Buffer to S3

See original GitHub issue

Problem description

I am requesting a set of files, zipping them, and then upload the zipped data to S3 using smart_open and a io.BytesIO() object. The size of the compressed data exceeds the 5 Gb S3 limit, and I know that in that case a multi-parts approach should be use (just like in boto3). I am using smart_open.s3.open() for doing this, but I do not completely understand how to configure the multi-part upload to avoid the EntityTooLarge error. I keep getting the error when using my code. Should I divide my file before hand or specify the number of parts? Checking the source code I don’t see a num_parts option.

 (EntityTooLarge) when calling the UploadPart operation: Your proposed upload exceeds the maximum allowed size

My function is the following:

def stream_time_range_s3(start_date,
                         end_date,
                         aws_key,
                         aws_secret,
                         aws_bucket_name,
                         key,
                         max_workers,
                         delta):
    """
    Download individual month directory of .grd files to local directory.

    This function will download using the ftplib all the .grd files between the
    start_date and the end_date. All dates in the NOAA NARR server are
    stored following this order:
        data
        ├── year/month
            ├── year/month/day01
            ├── year/month/day02

    Here we download the monthly directory with the user-defined dates in the
    start and end dates. 

    Params:
        - start_year str: year to start download.
        - end_year str: year to stop download.
    """

    logger = logging.getLogger(__name__)

    if not isinstance(start_date, datetime):
        start_date = datetime.strptime(start_date, '%Y-%m-%d')
    else:
        ValueError(f'{start_date} is not in the correct format or not a valid type')


    session = boto3.Session(
        aws_access_key_id=aws_key,
        aws_secret_access_key=aws_secret
    )

    base_url = 'https://nomads.ncdc.noaa.gov/data/narr'
    time = ['0000', '0300', '0600', '0900', '1200', '1500', '1800', '2100']
 
    if delta is None:
        dates = datetime_range(start_date, end_date, {'days':1})
    else:
        dates = datetime_range(start_date, end_date, delta)

    urls_time_range = []
    for day, time in product(dates, time):
           file_name = f'narr-a_221_{day.strftime("%Y%m%d")}_{time}_000.grb'
           url = URL(base_url, day.strftime('%Y%m'), day.strftime('%Y%m%d'))
           urls_time_range.append(str(URL(url, file_name)))

    with multiprocessing.Pool(max_workers) as p:
        results = p.map(requests_to_s3, urls_time_range, chunksize=1)

        print('Finish download')
        buf = io.BytesIO()
        with zipfile.ZipFile(buf, mode='w', compression=zipfile.ZIP_DEFLATED, compresslevel=1) as zf:
            for content_file_name, content_file_result in results:
                try:
                    zf.writestr(content_file_name,
                                content_file_result)
                except Exception as exc:
                    print(exc)

        print('Finish zipping  - Upload Start')
        with smart_open.s3.open(aws_bucket_name, key, 'wb', session=session) as so:
            so.write(buf.getvalue())

    return None

You can test the function by running:

from datetime import datetime

a = stream_time_range_s3(start_date=datetime(2012, 1, 1),
end_date=datetime(2012, 2, 1),
aws_key=aws_key,
delta=None,
aws_secret=aws_secret,
aws_bucket_name=bucket_name,
key='wind_2012_test_parts.zip',
max_workers=10)

Versions

Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.1 (default, Feb 27 2019, 18:57:54)
[Clang 10.0.0 (clang-1000.10.44.4)]
smart_open 1.8.4

Issue Analytics

State:
Created 4 years ago
Comments:10 (2 by maintainers)

Top GitHub Comments

4reactions

piskvorkycommented, Mar 25, 2021

so just raise an exception when one fout.write(buff) is called with a buff > 5GB?

smart_open’s promise is to handle large uploads (and downloads) transparently. So instead of raising an exception, isn’t it better to split the chunk into multipart pieces, each smaller than 5GB?

IIRC smart_open is already handling multipart uploads transparently under the hood, so this should be no different.

2reactions

davidparks21commented, Feb 11, 2022

I started working on a solution, but it wasn’t a very trivial change the way it’s currently coded. I ran out of time and abandoned the effort back when I posted. I’m not sure about the current status, but my solution was to simply chunk the calls to file.write(n) which was pretty trivial in my use case.