Does Koalas respect partitions when filtering on partitioned columns?
See original GitHub issueI’d posted earlier about how .head()
seemed slow for a large dataset when compared to a spark dataframe’s .head()
(Prev Issue). This was solved by switching to distributed
type index.
However, using the distributed
type index didn’t seem to speed it up for filters on partitioned columns. For context, I have a partitioned parquet table (117M rows, ~1000 partitions) that is stored on S3 with the Hive metastore pointing to a table name, say, table_foo
. Let’s say it’s partitioned by columns, p1
, p2
.
I tried to query just the partitions too to see if it speeds it up, but it was still as slow. This makes me think the partitions are actually not being respected in the koalas dataframe when using filters like below. Is that accurate?
kdf[(kdf['p1']=='hello') & (kdf['p2']=='world')].head()
If I read the table as a spark dataframe, it seems to work as expected:
sdf = spark.read.table('table_foo')
I can run something like:
sdf.filter("p1 = 'hello' and p2 = 'world'").head(20)
and that is really fast.
When I filter on partition columns, it seems to be scanning the whole table again. Any tips on if I’m doing something wrong here?
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (2 by maintainers)
Top GitHub Comments
Seems like other default indexes also use non-deterministic expressions. We should use
index_col
as long as possible.I am closing this ticket as the workaround is provided at this moment.