question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] `select from table limit` reads the full dataset and persists in memory.

See original GitHub issue

What happened: When performing a SELECT * FROM table LIMIT 10, from a table read in via parquet, I notice the full dataset being read and persisted on query execution.

What you expected to happen: Nothing to happen at query execution and when the user does decide to persist/compute the result only the relevant subset of data is read in.

Minimal Complete Verifiable Example:

from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cudf
import dask_cudf
from dask_sql import Context
import dask

write_data = False

if __name__ == "__main__":
    cluster = LocalCUDACluster()
    client = Client(cluster)
    c = Context()
    
    if write_data:
        dask.datasets.timseries(start="2022-01-01", end="2024-01-01").to_parquet("test_data.parquet")


    ddf = dask_cudf.read_parquet("test_data.parquet")
    c.create_table("test", ddf, persist=False)

    # This results in the whole dataset persisted in memory and even though `len(res)==5` all the data is in memory
    res = c.sql("SELECT * from test LIMIT 10")

Anything else we need to know?:

Environment:

  • dask-sql version: 2022.1.0
  • Python version: 3.8
  • Operating System: ubuntu 18.04
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
ayushdgcommented, Jan 28, 2022

Thanks for that context. I think we should be good to close this issue but should raise another issue in dask-sql to explore operations where we could leverage iloc and/or just precomputing partition_sizes instead of persisting once that PR lands.

1reaction
charlesblucacommented, Jan 28, 2022

Yeah, I think even some smaller changes being discussed there (such as the addition of a pre-computed partition_sizes attribute) could be used to reduce the amount of cases where we would be forced to partially / fully persist a frame

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quotas and limits | BigQuery
Describes the quotas and limits that apply to BigQuery jobs, queries, tables, datasets, DML, UDFs, API requests, etc.
Read more >
Troubleshoot out of memory errors with Azure SQL Database
This can be caused by various reasons including the limits of selected service objective, aggregate workload memory demands, and memory demands ...
Read more >
Chapter 4. Query Performance Optimization
In the previous chapter, we explained how to optimize a schema, which is one of the necessary conditions for high performance.
Read more >
Best practices for caching in Spark SQL | by David Vrba
Before you cache, make sure you are caching only what you will need in your queries. For example, if one query will use...
Read more >
MySQL bugs fixed by Aurora MySQL database engine ...
The following table identifies additional MySQL bugs that have been fixed by ... now is more memory-efficient and less likely to exceed the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found