[DF] select * limit 5 seems does a full scan
See original GitHub issueI’m struggling to find a programmatic reproducer for this, but on the datafusion-sql-planner branch:
c.sql("SELECT * FROM large_table limit 5")
results in reading the entire dataset before filtering at the end, instead of reading from a single partition.
Less reproducible, but from the daily weather data:
res = c.sql("select * from weather limit 5")
io_layer = list(res.dask.layers.keys())[0]
partitions = len(list(res.dask.layers[io_layer].keys()))
partitions
445
If I understand layers correctly, my select w/ limit statement is reading all partitions in the dataset.
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top Results From Across the Web
Stop using the LIMIT clause wrong with Spark | by Jyoti Dhiman
First things first, limit is used for taking a subset of records of the whole dataset, taking n records from the complete dataset....
Read more >Solved: Spark SQL: Limit clause performance issues
Using the LIMIT clause in my SQL statement or the corresponding dataframe method DF.limit doesn't help, as the query still takes too long....
Read more >How to avoid the full table scan - Ask TOM
Is there any way to avoid the full table scan on MAP table? What ever I try, one table is always going for...
Read more >In pyspark, how to select n rows of DataFrame without scan ...
I have tried limit , sample , but these function will still scan the whole table, the time complexity are O(N*), which takes...
Read more >Important Considerations when filtering in Spark with filter and ...
This blog post explains how to filter in Spark and discusses the vital factors to consider when filtering.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That’s a great point that I hadn’t considered! Knowing that, I think we can probably update our current method for computing LIMITs on main, which as @ayushdg has a chance of forcing 2 full graph computes:
https://github.com/dask-contrib/dask-sql/blob/3ac2f6df6f806e2c2fc5a9ff4b4bfbaa84d931cb/dask_sql/physical/rel/logical/limit.py#L53-L58
If we replace the assignment of
first_partition_length
with a call to computepartition_borders
, then we will have all the partition lengths and guarantee that we can compute the limit withdf.head(...)
.I currently have a small WIP that seems to be passing, I can open a PR to continue discussion there 🙂
Sounds good! I’m also down for continuing the discussion there - I’m not sure why the failures were happening on my end, so it will be helpful to compare.