Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support chunksize in I/O functionalities

See original GitHub issue

Describe the problem

Chunksize defaults to pandas in all cases. Ideally, we will be able to support it. The functionality should not require deep analysis, we can just mask the result of reading with chunksize=None.

Relevant conversation in #909.

Source code / logs

Proposed solution for read_sql:

def read_sql(
    sql,
    con,
    index_col=None,
    coerce_float=True,
    params=None,
    parse_dates=None,
    columns=None,
    chunksize=None,
):
    _, _, _, kwargs = inspect.getargvalues(inspect.currentframe())
    kwargs["chunksize"] = None
    df = DataFrame(query_compiler=BaseFactory.read_sql(**kwargs))
    if chunksize is not None:
        num_chunks = len(df) // chunksize if len(df) % chunksize == 0 else len(df) // chunksize + 1
        return (df.iloc[i * chunksize: (i + 1) * chunksize] for i in num_chunks)
    else:
        return df

Issue Analytics

State:
Created 4 years ago
Comments:16 (7 by maintainers)

Top GitHub Comments

2reactions

eavidancommented, Dec 31, 2019

PRs are always welcomed @manesioz 😃

I believe the yield-based approach fits the chunking principle of pandas better. For instance, querying in small chucks should result in a quick response from the database during the first iteration. It is expected that each iteration starts with a delay for getting its respectful chunk. Waiting a long time for the first iteration to start is not an expected behavior with a small chunksize. Moreover, the user may decide to stop iterating at some point, and this will not improve performance if everything was loaded into memory already.

0reactions

devin-petersohncommented, Jan 9, 2020

Why don’t we try to implement with fetchmany. It may not work, but it will be worth trying. Otherwise, the right approach would be to use @eavidan’s approach and throw a warning about the order.

Top Results From Across the Web

Efficient Pandas: Using Chunksize for Large Datasets

But for this article, we shall use the pandas chunksize attribute or get_chunk() function. Imagine for a second that you're working on a...

Optimal chunksize parameter in pandas.DataFrame.to_sql

According to the observations in this article (acepor.github.io/2017/08/03/using-chunksize), setting the chunksize to 10000 seems to be optimal.

Reducing Pandas memory usage #3: Reading in chunks

Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.

How to Load a Massive File as small chunks in Pandas?

chunksize : int, optional Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that ... New in version 1.5.0: Support for...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Support chunksize in I/O functionalities

Describe the problem

Source code / logs

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Encoding error : `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte`

Update Dask and Distributed minimum version to 2.12 for Python 3.8 compatibility