Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] `select from table limit` reads the full dataset and persists in memory.

See original GitHub issue

What happened: When performing a SELECT * FROM table LIMIT 10, from a table read in via parquet, I notice the full dataset being read and persisted on query execution.

What you expected to happen: Nothing to happen at query execution and when the user does decide to persist/compute the result only the relevant subset of data is read in.

Minimal Complete Verifiable Example:

from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cudf
import dask_cudf
from dask_sql import Context
import dask

write_data = False

if __name__ == "__main__":
    cluster = LocalCUDACluster()
    client = Client(cluster)
    c = Context()
    
    if write_data:
        dask.datasets.timseries(start="2022-01-01", end="2024-01-01").to_parquet("test_data.parquet")


    ddf = dask_cudf.read_parquet("test_data.parquet")
    c.create_table("test", ddf, persist=False)

    # This results in the whole dataset persisted in memory and even though `len(res)==5` all the data is in memory
    res = c.sql("SELECT * from test LIMIT 10")

Anything else we need to know?:

Environment:

dask-sql version: 2022.1.0
Python version: 3.8
Operating System: ubuntu 18.04
Install method (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

1reaction

ayushdgcommented, Jan 28, 2022

Thanks for that context. I think we should be good to close this issue but should raise another issue in dask-sql to explore operations where we could leverage iloc and/or just precomputing partition_sizes instead of persisting once that PR lands.

1reaction

charlesblucacommented, Jan 28, 2022

Yeah, I think even some smaller changes being discussed there (such as the addition of a pre-computed partition_sizes attribute) could be used to reduce the amount of cases where we would be forced to partially / fully persist a frame