Support chunksize in I/O functionalities
See original GitHub issueDescribe the problem
Chunksize defaults to pandas in all cases. Ideally, we will be able to support it. The functionality should not require deep analysis, we can just mask the result of reading with chunksize=None
.
Relevant conversation in #909.
Source code / logs
Proposed solution for read_sql
:
def read_sql(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
columns=None,
chunksize=None,
):
_, _, _, kwargs = inspect.getargvalues(inspect.currentframe())
kwargs["chunksize"] = None
df = DataFrame(query_compiler=BaseFactory.read_sql(**kwargs))
if chunksize is not None:
num_chunks = len(df) // chunksize if len(df) % chunksize == 0 else len(df) // chunksize + 1
return (df.iloc[i * chunksize: (i + 1) * chunksize] for i in num_chunks)
else:
return df
Issue Analytics
- State:
- Created 4 years ago
- Comments:16 (7 by maintainers)
Top Results From Across the Web
Efficient Pandas: Using Chunksize for Large Datasets
But for this article, we shall use the pandas chunksize attribute or get_chunk() function. Imagine for a second that you're working on a...
Read more >Optimal chunksize parameter in pandas.DataFrame.to_sql
According to the observations in this article (acepor.github.io/2017/08/03/using-chunksize), setting the chunksize to 10000 seems to be optimal.
Read more >Reducing Pandas memory usage #3: Reading in chunks
Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.
Read more >How to Load a Massive File as small chunks in Pandas?
chunksize : int, optional Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that ... New in version 1.5.0: Support for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
PRs are always welcomed @manesioz 😃
I believe the
yield
-based approach fits the chunking principle ofpandas
better. For instance, querying in small chucks should result in a quick response from the database during the first iteration. It is expected that each iteration starts with a delay for getting its respectful chunk. Waiting a long time for the first iteration to start is not an expected behavior with a smallchunksize
. Moreover, the user may decide to stop iterating at some point, and this will not improve performance if everything was loaded into memory already.Why don’t we try to implement with
fetchmany
. It may not work, but it will be worth trying. Otherwise, the right approach would be to use @eavidan’s approach and throw a warning about the order.