question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

s3.read_csv slow with chunksize

See original GitHub issue

Describe the bug

I’m not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:

it = wr.s3.read_csv(uri, chunksize=chunksize)

I think the chunksize parameter is ignored.

To Reproduce

I’m running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.

from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO

uri = ""

CHUNKSIZE = 100

def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    with sopen(uri, "r") as f:
        chunk = "".join(islice(f, chunksize))
        df = pd.read_csv(StringIO(chunk))

    return df


def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    it = wr.s3.read_csv(uri, chunksize=chunksize)

    df = next(it)

    return df

I compared two different ways to load the first 100 lines of a “big” (1.2 GB) dataframe from S3:

  • with the equivalent of open(file, "r") and then lazily parsing the lines as a CSV string
  • using s3.read_csv with chunksize=100.

Results:

In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms

In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s

In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s

The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.

Is it expected?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
igorborgestcommented, Jul 30, 2020

Released in 1.7.0!

1reaction
igorborgestcommented, Jul 21, 2020

I read that it “will override the regular default arguments configured in the function signature.” though?

Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?

Read more comments on GitHub >

github_iconTop Results From Across the Web

s3.read_csv slow with chunksize · Issue #324 - GitHub
Describe the bug I'm not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, ...
Read more >
awswrangler.s3.read_csv — AWS SDK for pandas 2.18.0 ...
Read CSV file(s) from a received S3 prefix or list of S3 objects paths. This function accepts Unix shell-style wildcards in the path...
Read more >
How do I read a large csv file with pandas? - Stack Overflow
The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.) pandas >=...
Read more >
Transfer large amounts of data between ... - Amazon AWS
I want to transfer a large amount of data (1 TB or more) from one Amazon Simple Storage Service (Amazon S3) bucket to...
Read more >
Efficient Pandas: Using Chunksize for Large Datasets
This gets the first 100 rows, running through a loop gets the next 100 rows and so on. # Both chunksize=100 and reader....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found