s3 multipart upload doesn't complete correctly
See original GitHub issueProblem description
in the heavily concurrent environment, it was noticed that our process leaves a lot of noncompleted multipart uploads and some files are missing without an exception generated. I reviewed the smart_open code and see that there is no respond check in the close call. However, documentation https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html. that response always 200 and it is important that you check the response body to determine whether the request succeeded.
I think the response has to be checked and multipart upload has to be aborted in the case of the failure and also exception is thrown further.
Steps/code to reproduce the problem
this issue is difficult to reproduce as it appears randomly We run code in AWS lambda with 1000 parallelism The function reads line delimited JSON and JSON document to multiple output files One time function can process file successfully another time it generates not completed multi parts uploads without letting code know that it is failing.
def process_file_multiple_outputs(params):
cnt = 0
formatted_cnt = 0
errors_cnt = 0
formatter = params['formatter']
compression = params['compression']
file_id = params['src_file_id']
src_file = params['src_file_name']
dst_file = params['data_file_s3_path']
error_file = params['error_file_s3_path']
csv_format = True if params['csv_format'] == 'csv' or re.search(".*\.csv(\..*|$)", src_file) else False
dst_files = dict()
try:
with (smart_open(src_file, "rb")) as src:
for line in open_as_csv(csv_format, decompress_if_missing_extension(compression, src_file, src)):
cnt += 1
errors, data = formatter(file_id, line.decode("utf-8").strip() if not csv_format else line, cnt)
formatted_cnt += len(data)
errors_cnt += len(errors)
for table_name in data:
if table_name not in dst_files:
dst_files[table_name] = smart_open(dst_file + '-' + table_name + '.json.gz', "wb")
write_json_data(dst_files[table_name], data[table_name])
if len(errors) != 0:
if 'ERRORS' not in dst_files:
dst_files['ERRORS'] = smart_open(error_file + '-errors' + '.json.gz', "wb")
write_json_data(dst_files['ERRORS'], errors)
except Exception as e:
result = "Failure"
error_msg = format_exc()
error_msg += "\n Error at line %d : %s" % (cnt, str(e))
else:
result = "Success"
error_msg = None
finally:
for _, file in dst_files.items():
file.close()
return ({'src_file_id': file_id,
'src_num_lines': cnt,
'csv_num_lines': formatted_cnt,
'csv_error_lines': errors_cnt,
'csv_file_name': dst_file,
'result': result,
'error_msg': error_msg,
'is_processed': True})
Versions
Please provide the output of:
Python 3.6.8 smart-open==1.7.1 boto==2.49.0 boto3==1.9.8 botocore==1.12.8
Issue Analytics
- State:
- Created 4 years ago
- Comments:8
Hi Michael,
this was funny I think I was working to much these days. We are looking to resolve this issue in case we find a solution I will do it if it is a part of the smart open.
Valery
On Tue., Nov. 12, 2019, 4:35 p.m. Michael Penkov, notifications@github.com wrote:
What is the output of your script? In particular, what exceptions do you end up encountering? Seeing some stack traces would be helpful.
Also, I think you need to do better exception handling. One thing I would do is:
For the specific handling, there are several ideas:
Generally, if you suspect the problem happens when you’re trying to complete the multipart upload, then the “hard work” of uploading the parts to S3 is already done. All you need to do is assemble the parts in the correct order.