Slow S3 write for csv.gz file
See original GitHub issueProblem description
Be sure your description clearly answers the following questions:
- What are you trying to achieve? We try to move data (in memory) from database to S3 (in csv.gz format) using smart_open.
- What is the expected result? We are reading 10M at a time & write it back to s3. This write should not take more than 5 mins.
- What are you seeing instead? This write is taking more 2hr
Steps/code to reproduce the problem
params = {'session': session}
with smart_open.open(f's3://{bucket}/{key}', 'w', encoding='utf-8', transport_params=params) as fout:
writer = csv.writer(fout,
delimiter = '\x01',
quoting = csv.QUOTE_ALL,
escapechar = '\\',
quotechar = '`')
print(f"{datetime.datetime.now()}: Initial Fetch Started.")
rows = cur.fetchmany(10000000)
row_count = 0
print(f"{datetime.datetime.now()}: Downloaded initial Data from the databases.")
while rows:
print(f"{datetime.datetime.now()}: data write start for csv for {len(rows)}.")
writer.writerows(rows)
row_count += len(rows)
print(f"{datetime.datetime.now()}: {row_count} rows written in s3 csv gzip.")
rows = cur.fetchmany(10000000)
print(f"{datetime.datetime.now()}: Downloaded data from the databases.")
fout.close()
Versions
Linux-4.9.217-0.1.ac.205.84.332.metal1.x86_64-x86_64-with-redhat-5.3-Tikanga Python 3.6.10 (default, Jul 12 2020, 20:42:51) [GCC 4.9.4] smart_open 2.0.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
How to Efficiently Transform a CSV File and Upload it in ...
This will be used to store data of compressed (gzipped) object which will be uploaded to S3. Line # 10 to 11: We...
Read more >Optimize uploads of large files to Amazon S3 - AWS
I want to upload large files (1 GB or larger) to Amazon Simple Storage Service (Amazon S3). How can I optimize the performance...
Read more >Reading a compressed CSV file from S3 - Stack Overflow
I'm supposed to read thousands of *.CSV files from S3 using Spark. These files have Content-Encoding of gzip as metadata in their properties ......
Read more >Stream GZ File FROM S3, Decompressed and Upload to S3
There's 2 possibility. - Probably the s3fs or goofys or whatever fuse mount you are using is slow for the access data pattern...
Read more >Dealing with Large gzip Files in Spark - Medium
Each date had multiple parts, and each each part was a gzipped csv file ~4GB. ... data into smaller chunks (with spark dataframe)...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
What happens if you drop the
.gz
(write only.csv
), is writing still slow? Or if you store the output to a local file (not to S3).I’m trying to localize what the issue may be: network connectivity, compression, DB…
Is the output file written correctly? When you download & unzip the generated
.csv.gz
from S3, does it open correctly.smart_open
comes with all benefits already enabled, so there’s nothing you can do here (besides trying the options above to provide more clues).