Upload big io.Buffer to S3
See original GitHub issueProblem description
I am requesting a set of files, zipping them, and then upload the zipped data to S3 using smart_open
and a io.BytesIO()
object. The size of the compressed data exceeds the 5 Gb S3 limit, and I know that in that case a multi-parts approach should be use (just like in boto3
). I am using smart_open.s3.open()
for doing this, but I do not completely understand how to configure the multi-part upload to avoid the EntityTooLarge
error. I keep getting the error when using my code. Should I divide my file before hand or specify the number of parts? Checking the source code I don’t see a num_parts
option.
(EntityTooLarge) when calling the UploadPart operation: Your proposed upload exceeds the maximum allowed size
My function is the following:
def stream_time_range_s3(start_date,
end_date,
aws_key,
aws_secret,
aws_bucket_name,
key,
max_workers,
delta):
"""
Download individual month directory of .grd files to local directory.
This function will download using the ftplib all the .grd files between the
start_date and the end_date. All dates in the NOAA NARR server are
stored following this order:
data
├── year/month
├── year/month/day01
├── year/month/day02
Here we download the monthly directory with the user-defined dates in the
start and end dates.
Params:
- start_year str: year to start download.
- end_year str: year to stop download.
"""
logger = logging.getLogger(__name__)
if not isinstance(start_date, datetime):
start_date = datetime.strptime(start_date, '%Y-%m-%d')
else:
ValueError(f'{start_date} is not in the correct format or not a valid type')
session = boto3.Session(
aws_access_key_id=aws_key,
aws_secret_access_key=aws_secret
)
base_url = 'https://nomads.ncdc.noaa.gov/data/narr'
time = ['0000', '0300', '0600', '0900', '1200', '1500', '1800', '2100']
if delta is None:
dates = datetime_range(start_date, end_date, {'days':1})
else:
dates = datetime_range(start_date, end_date, delta)
urls_time_range = []
for day, time in product(dates, time):
file_name = f'narr-a_221_{day.strftime("%Y%m%d")}_{time}_000.grb'
url = URL(base_url, day.strftime('%Y%m'), day.strftime('%Y%m%d'))
urls_time_range.append(str(URL(url, file_name)))
with multiprocessing.Pool(max_workers) as p:
results = p.map(requests_to_s3, urls_time_range, chunksize=1)
print('Finish download')
buf = io.BytesIO()
with zipfile.ZipFile(buf, mode='w', compression=zipfile.ZIP_DEFLATED, compresslevel=1) as zf:
for content_file_name, content_file_result in results:
try:
zf.writestr(content_file_name,
content_file_result)
except Exception as exc:
print(exc)
print('Finish zipping - Upload Start')
with smart_open.s3.open(aws_bucket_name, key, 'wb', session=session) as so:
so.write(buf.getvalue())
return None
You can test the function by running:
from datetime import datetime
a = stream_time_range_s3(start_date=datetime(2012, 1, 1),
end_date=datetime(2012, 2, 1),
aws_key=aws_key,
delta=None,
aws_secret=aws_secret,
aws_bucket_name=bucket_name,
key='wind_2012_test_parts.zip',
max_workers=10)
Versions
Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.1 (default, Feb 27 2019, 18:57:54)
[Clang 10.0.0 (clang-1000.10.44.4)]
smart_open 1.8.4
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (2 by maintainers)
Top Results From Across the Web
Optimize uploads of large files to Amazon S3 - AWS re:Post
I want to upload large files (1 GB or larger) to Amazon Simple Storage Service (Amazon S3). How can I optimize the performance...
Read more >Upload an object to an Amazon S3 bucket using an AWS SDK
The following code examples show how to upload an object to an S3 bucket. .NET. AWS SDK for .NET. Note. There's more on...
Read more >Uploading large files as stream to S3 in .NET without buffering
Our application generates large (20GB+) ZIP archives on demand to a stream. We would like to upload the stream as it is produced...
Read more >Direct to S3 File Uploads in Node.js - Heroku Dev Center
js application that uploads files directly to S3 instead of via a web application, utilising S3's Cross-Origin Resource Sharing (CORS) support.
Read more >Amazon S3 - Fluent Bit: Official Manual
The plugin allows you to specify a maximum file size, and a timeout for uploads. A file will be created in S3 when...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
smart_open’s promise is to handle large uploads (and downloads) transparently. So instead of raising an exception, isn’t it better to split the chunk into multipart pieces, each smaller than 5GB?
IIRC smart_open is already handling multipart uploads transparently under the hood, so this should be no different.
I started working on a solution, but it wasn’t a very trivial change the way it’s currently coded. I ran out of time and abandoned the effort back when I posted. I’m not sure about the current status, but my solution was to simply chunk the calls to
file.write(n)
which was pretty trivial in my use case.