question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow S3 write for csv.gz file

See original GitHub issue

Problem description

Be sure your description clearly answers the following questions:

  • What are you trying to achieve? We try to move data (in memory) from database to S3 (in csv.gz format) using smart_open.
  • What is the expected result? We are reading 10M at a time & write it back to s3. This write should not take more than 5 mins.
  • What are you seeing instead? This write is taking more 2hr

Steps/code to reproduce the problem

params = {'session': session}
with smart_open.open(f's3://{bucket}/{key}', 'w', encoding='utf-8', transport_params=params) as fout:
    writer = csv.writer(fout,
                        delimiter = '\x01',
                        quoting = csv.QUOTE_ALL,
                        escapechar = '\\',
                        quotechar = '`')
    print(f"{datetime.datetime.now()}: Initial Fetch Started.")
    rows = cur.fetchmany(10000000)
    row_count = 0
    print(f"{datetime.datetime.now()}: Downloaded initial Data from the databases.")
    
    while rows:
        print(f"{datetime.datetime.now()}: data write start for csv for {len(rows)}.")
        writer.writerows(rows)
        row_count += len(rows)
        print(f"{datetime.datetime.now()}: {row_count} rows written in s3 csv gzip.")
        rows = cur.fetchmany(10000000)
        print(f"{datetime.datetime.now()}: Downloaded data from the databases.")

fout.close()

Versions

Linux-4.9.217-0.1.ac.205.84.332.metal1.x86_64-x86_64-with-redhat-5.3-Tikanga Python 3.6.10 (default, Jul 12 2020, 20:42:51) [GCC 4.9.4] smart_open 2.0.0

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
piskvorkycommented, Jul 15, 2020

What happens if you drop the .gz (write only .csv), is writing still slow? Or if you store the output to a local file (not to S3).

I’m trying to localize what the issue may be: network connectivity, compression, DB…

Is the output file written correctly? When you download & unzip the generated .csv.gz from S3, does it open correctly.

1reaction
piskvorkycommented, Jul 15, 2020

smart_open comes with all benefits already enabled, so there’s nothing you can do here (besides trying the options above to provide more clues).

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Efficiently Transform a CSV File and Upload it in ...
This will be used to store data of compressed (gzipped) object which will be uploaded to S3. Line # 10 to 11: We...
Read more >
Optimize uploads of large files to Amazon S3 - AWS
I want to upload large files (1 GB or larger) to Amazon Simple Storage Service (Amazon S3). How can I optimize the performance...
Read more >
Reading a compressed CSV file from S3 - Stack Overflow
I'm supposed to read thousands of *.CSV files from S3 using Spark. These files have Content-Encoding of gzip as metadata in their properties ......
Read more >
Stream GZ File FROM S3, Decompressed and Upload to S3
There's 2 possibility. - Probably the s3fs or goofys or whatever fuse mount you are using is slow for the access data pattern...
Read more >
Dealing with Large gzip Files in Spark - Medium
Each date had multiple parts, and each each part was a gzipped csv file ~4GB. ... data into smaller chunks (with spark dataframe)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found