Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow S3 write for csv.gz file

See original GitHub issue

Problem description

Be sure your description clearly answers the following questions:

What are you trying to achieve? We try to move data (in memory) from database to S3 (in csv.gz format) using smart_open.
What is the expected result? We are reading 10M at a time & write it back to s3. This write should not take more than 5 mins.
What are you seeing instead? This write is taking more 2hr

Steps/code to reproduce the problem

params = {'session': session}
with smart_open.open(f's3://{bucket}/{key}', 'w', encoding='utf-8', transport_params=params) as fout:
    writer = csv.writer(fout,
                        delimiter = '\x01',
                        quoting = csv.QUOTE_ALL,
                        escapechar = '\\',
                        quotechar = '`')
    print(f"{datetime.datetime.now()}: Initial Fetch Started.")
    rows = cur.fetchmany(10000000)
    row_count = 0
    print(f"{datetime.datetime.now()}: Downloaded initial Data from the databases.")
    
    while rows:
        print(f"{datetime.datetime.now()}: data write start for csv for {len(rows)}.")
        writer.writerows(rows)
        row_count += len(rows)
        print(f"{datetime.datetime.now()}: {row_count} rows written in s3 csv gzip.")
        rows = cur.fetchmany(10000000)
        print(f"{datetime.datetime.now()}: Downloaded data from the databases.")

fout.close()

Versions

Linux-4.9.217-0.1.ac.205.84.332.metal1.x86_64-x86_64-with-redhat-5.3-Tikanga Python 3.6.10 (default, Jul 12 2020, 20:42:51) [GCC 4.9.4] smart_open 2.0.0

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

2reactions

piskvorkycommented, Jul 15, 2020

What happens if you drop the .gz (write only .csv), is writing still slow? Or if you store the output to a local file (not to S3).

I’m trying to localize what the issue may be: network connectivity, compression, DB…

Is the output file written correctly? When you download & unzip the generated .csv.gz from S3, does it open correctly.

1reaction

piskvorkycommented, Jul 15, 2020

smart_open comes with all benefits already enabled, so there’s nothing you can do here (besides trying the options above to provide more clues).

Top Results From Across the Web

How to Efficiently Transform a CSV File and Upload it in ...

This will be used to store data of compressed (gzipped) object which will be uploaded to S3. Line # 10 to 11: We...

Optimize uploads of large files to Amazon S3 - AWS

I want to upload large files (1 GB or larger) to Amazon Simple Storage Service (Amazon S3). How can I optimize the performance...

Reading a compressed CSV file from S3 - Stack Overflow

I'm supposed to read thousands of *.CSV files from S3 using Spark. These files have Content-Encoding of gzip as metadata in their properties ......

Stream GZ File FROM S3, Decompressed and Upload to S3

There's 2 possibility. - Probably the s3fs or goofys or whatever fuse mount you are using is slow for the access data pattern...

Dealing with Large gzip Files in Spark - Medium

Each date had multiple parts, and each each part was a gzipped csv file ~4GB. ... data into smaller chunks (with spark dataframe)...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Slow S3 write for csv.gz file

Problem description

Steps/code to reproduce the problem

Versions

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Reading parquet using smart_open+pandas is 3x slower than pandas

Compression not handled for GCS files