question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DF] select * limit 5 seems does a full scan

See original GitHub issue

I’m struggling to find a programmatic reproducer for this, but on the datafusion-sql-planner branch:

c.sql("SELECT * FROM large_table limit 5")

results in reading the entire dataset before filtering at the end, instead of reading from a single partition.

Less reproducible, but from the daily weather data:

res = c.sql("select * from weather limit 5")
io_layer = list(res.dask.layers.keys())[0]
partitions = len(list(res.dask.layers[io_layer].keys()))
partitions
445

If I understand layers correctly, my select w/ limit statement is reading all partitions in the dataset.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
charlesblucacommented, Aug 16, 2022

although we can also get the length of each partition with partition_borders = df.map_partitions(lambda x: len(x)).divisions

That’s a great point that I hadn’t considered! Knowing that, I think we can probably update our current method for computing LIMITs on main, which as @ayushdg has a chance of forcing 2 full graph computes:

https://github.com/dask-contrib/dask-sql/blob/3ac2f6df6f806e2c2fc5a9ff4b4bfbaa84d931cb/dask_sql/physical/rel/logical/limit.py#L53-L58

If we replace the assignment of first_partition_length with a call to compute partition_borders, then we will have all the partition lengths and guarantee that we can compute the limit with df.head(...).

Update: Nevermind, looks like that causes some tests in test_join.py and test_select.py to fail.

I currently have a small WIP that seems to be passing, I can open a PR to continue discussion there 🙂

0reactions
sarahyurickcommented, Aug 16, 2022

Sounds good! I’m also down for continuing the discussion there - I’m not sure why the failures were happening on my end, so it will be helpful to compare.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stop using the LIMIT clause wrong with Spark | by Jyoti Dhiman
First things first, limit is used for taking a subset of records of the whole dataset, taking n records from the complete dataset....
Read more >
Solved: Spark SQL: Limit clause performance issues
Using the LIMIT clause in my SQL statement or the corresponding dataframe method DF.limit doesn't help, as the query still takes too long....
Read more >
How to avoid the full table scan - Ask TOM
Is there any way to avoid the full table scan on MAP table? What ever I try, one table is always going for...
Read more >
In pyspark, how to select n rows of DataFrame without scan ...
I have tried limit , sample , but these function will still scan the whole table, the time complexity are O(N*), which takes...
Read more >
Important Considerations when filtering in Spark with filter and ...
This blog post explains how to filter in Spark and discusses the vital factors to consider when filtering.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found