question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support chunksize in I/O functionalities

See original GitHub issue

Describe the problem

Chunksize defaults to pandas in all cases. Ideally, we will be able to support it. The functionality should not require deep analysis, we can just mask the result of reading with chunksize=None.

Relevant conversation in #909.

Source code / logs

Proposed solution for read_sql:

def read_sql(
    sql,
    con,
    index_col=None,
    coerce_float=True,
    params=None,
    parse_dates=None,
    columns=None,
    chunksize=None,
):
    _, _, _, kwargs = inspect.getargvalues(inspect.currentframe())
    kwargs["chunksize"] = None
    df = DataFrame(query_compiler=BaseFactory.read_sql(**kwargs))
    if chunksize is not None:
        num_chunks = len(df) // chunksize if len(df) % chunksize == 0 else len(df) // chunksize + 1
        return (df.iloc[i * chunksize: (i + 1) * chunksize] for i in num_chunks)
    else:
        return df

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
eavidancommented, Dec 31, 2019

PRs are always welcomed @manesioz 😃

I believe the yield-based approach fits the chunking principle of pandas better. For instance, querying in small chucks should result in a quick response from the database during the first iteration. It is expected that each iteration starts with a delay for getting its respectful chunk. Waiting a long time for the first iteration to start is not an expected behavior with a small chunksize. Moreover, the user may decide to stop iterating at some point, and this will not improve performance if everything was loaded into memory already.

0reactions
devin-petersohncommented, Jan 9, 2020

Why don’t we try to implement with fetchmany. It may not work, but it will be worth trying. Otherwise, the right approach would be to use @eavidan’s approach and throw a warning about the order.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Pandas: Using Chunksize for Large Datasets
But for this article, we shall use the pandas chunksize attribute or get_chunk() function. Imagine for a second that you're working on a...
Read more >
Optimal chunksize parameter in pandas.DataFrame.to_sql
According to the observations in this article (acepor.github.io/2017/08/03/using-chunksize), setting the chunksize to 10000 seems to be optimal.
Read more >
Reducing Pandas memory usage #3: Reading in chunks
Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.
Read more >
How to Load a Massive File as small chunks in Pandas?
chunksize : int, optional Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that ... New in version 1.5.0: Support for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found