[BUG] `select from table limit` reads the full dataset and persists in memory.
See original GitHub issueWhat happened:
When performing a SELECT * FROM table LIMIT 10
, from a table read in via parquet, I notice the full dataset being read and persisted on query execution.
What you expected to happen: Nothing to happen at query execution and when the user does decide to persist/compute the result only the relevant subset of data is read in.
Minimal Complete Verifiable Example:
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cudf
import dask_cudf
from dask_sql import Context
import dask
write_data = False
if __name__ == "__main__":
cluster = LocalCUDACluster()
client = Client(cluster)
c = Context()
if write_data:
dask.datasets.timseries(start="2022-01-01", end="2024-01-01").to_parquet("test_data.parquet")
ddf = dask_cudf.read_parquet("test_data.parquet")
c.create_table("test", ddf, persist=False)
# This results in the whole dataset persisted in memory and even though `len(res)==5` all the data is in memory
res = c.sql("SELECT * from test LIMIT 10")
Anything else we need to know?:
Environment:
- dask-sql version: 2022.1.0
- Python version: 3.8
- Operating System: ubuntu 18.04
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Results From Across the Web
Quotas and limits | BigQuery
Describes the quotas and limits that apply to BigQuery jobs, queries, tables, datasets, DML, UDFs, API requests, etc.
Read more >Troubleshoot out of memory errors with Azure SQL Database
This can be caused by various reasons including the limits of selected service objective, aggregate workload memory demands, and memory demands ...
Read more >Chapter 4. Query Performance Optimization
In the previous chapter, we explained how to optimize a schema, which is one of the necessary conditions for high performance.
Read more >Best practices for caching in Spark SQL | by David Vrba
Before you cache, make sure you are caching only what you will need in your queries. For example, if one query will use...
Read more >MySQL bugs fixed by Aurora MySQL database engine ...
The following table identifies additional MySQL bugs that have been fixed by ... now is more memory-efficient and less likely to exceed the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for that context. I think we should be good to close this issue but should raise another issue in dask-sql to explore operations where we could leverage
iloc
and/or just precomputingpartition_sizes
instead of persisting once that PR lands.Yeah, I think even some smaller changes being discussed there (such as the addition of a pre-computed
partition_sizes
attribute) could be used to reduce the amount of cases where we would be forced to partially / fully persist a frame