s3.read_csv slow with chunksize
See original GitHub issueDescribe the bug
I’m not sure the s3.read_csv
function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:
it = wr.s3.read_csv(uri, chunksize=chunksize)
I think the chunksize
parameter is ignored.
To Reproduce
I’m running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.
from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO
uri = ""
CHUNKSIZE = 100
def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
with sopen(uri, "r") as f:
chunk = "".join(islice(f, chunksize))
df = pd.read_csv(StringIO(chunk))
return df
def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
it = wr.s3.read_csv(uri, chunksize=chunksize)
df = next(it)
return df
I compared two different ways to load the first 100 lines of a “big” (1.2 GB) dataframe from S3:
- with the equivalent of
open(file, "r")
and then lazily parsing the lines as a CSV string - using
s3.read_csv
withchunksize=100
.
Results:
In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms
In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s
In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s
The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize
parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.
Is it expected?
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
s3.read_csv slow with chunksize · Issue #324 - GitHub
Describe the bug I'm not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, ...
Read more >awswrangler.s3.read_csv — AWS SDK for pandas 2.18.0 ...
Read CSV file(s) from a received S3 prefix or list of S3 objects paths. This function accepts Unix shell-style wildcards in the path...
Read more >How do I read a large csv file with pandas? - Stack Overflow
The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.) pandas >=...
Read more >Transfer large amounts of data between ... - Amazon AWS
I want to transfer a large amount of data (1 TB or more) from one Amazon Simple Storage Service (Amazon S3) bucket to...
Read more >Efficient Pandas: Using Chunksize for Large Datasets
This gets the first 100 rows, running through a loop gets the next 100 rows and so on. # Both chunksize=100 and reader....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Released in 1.7.0!
Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?