Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DF] select * limit 5 seems does a full scan

See original GitHub issue

I’m struggling to find a programmatic reproducer for this, but on the datafusion-sql-planner branch:

c.sql("SELECT * FROM large_table limit 5")

results in reading the entire dataset before filtering at the end, instead of reading from a single partition.

Less reproducible, but from the daily weather data:

res = c.sql("select * from weather limit 5")
io_layer = list(res.dask.layers.keys())[0]
partitions = len(list(res.dask.layers[io_layer].keys()))
partitions

If I understand layers correctly, my select w/ limit statement is reading all partitions in the dataset.

Issue Analytics

State:
Created a year ago
Comments:7

Top GitHub Comments

1reaction

charlesblucacommented, Aug 16, 2022

although we can also get the length of each partition with partition_borders = df.map_partitions(lambda x: len(x)).divisions

That’s a great point that I hadn’t considered! Knowing that, I think we can probably update our current method for computing LIMITs on main, which as @ayushdg has a chance of forcing 2 full graph computes:

https://github.com/dask-contrib/dask-sql/blob/3ac2f6df6f806e2c2fc5a9ff4b4bfbaa84d931cb/dask_sql/physical/rel/logical/limit.py#L53-L58

If we replace the assignment of first_partition_length with a call to compute partition_borders, then we will have all the partition lengths and guarantee that we can compute the limit with df.head(...).

Update: Nevermind, looks like that causes some tests in test_join.py and test_select.py to fail.

I currently have a small WIP that seems to be passing, I can open a PR to continue discussion there 🙂

0reactions

sarahyurickcommented, Aug 16, 2022

Sounds good! I’m also down for continuing the discussion there - I’m not sure why the failures were happening on my end, so it will be helpful to compare.