Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does Koalas respect partitions when filtering on partitioned columns?

See original GitHub issue

I’d posted earlier about how .head() seemed slow for a large dataset when compared to a spark dataframe’s .head() (Prev Issue). This was solved by switching to distributed type index.

However, using the distributed type index didn’t seem to speed it up for filters on partitioned columns. For context, I have a partitioned parquet table (117M rows, ~1000 partitions) that is stored on S3 with the Hive metastore pointing to a table name, say, table_foo. Let’s say it’s partitioned by columns, p1, p2.

I tried to query just the partitions too to see if it speeds it up, but it was still as slow. This makes me think the partitions are actually not being respected in the koalas dataframe when using filters like below. Is that accurate?
kdf[(kdf['p1']=='hello') & (kdf['p2']=='world')].head()
If I read the table as a spark dataframe, it seems to work as expected:
sdf = spark.read.table('table_foo')
I can run something like:
sdf.filter("p1 = 'hello' and p2 = 'world'").head(20)
and that is really fast.

When I filter on partition columns, it seems to be scanning the whole table again. Any tips on if I’m doing something wrong here?

Issue Analytics

State:
Created 4 years ago
Comments:11 (2 by maintainers)

Top GitHub Comments

1reaction

ueshincommented, Nov 11, 2019

Seems like other default indexes also use non-deterministic expressions. We should use index_col as long as possible.

0reactions

HyukjinKwoncommented, Mar 12, 2020

I am closing this ticket as the workaround is provided at this moment.

Read more comments on GitHub >

Top Results From Across the Web

Best Practices — Koalas 1.8.2 documentation - Read the Docs

One common issue when Koalas users face is the slow performance by default index. Koalas attaches a default index when the index is...

Data skipping index | Databricks on AWS

As skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files. This will typically ...

Fast Filtering with Spark PartitionFilters and PushedFilters

Spark can use the disk partitioning of files to greatly speed up certain filtering operations. This post explains the difference between ...

Querying Large Parquet Files with Pandas - Open Data Blend

These will be referred to as Fact DataFrames. Specify the required columns and filter predicates to be used for all the English Prescribing...

Koalas applymap moving all data to a single partition

You can fix this by specifying an index on the Koalas DataFrame. The default index is expected to give poor performance.

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Does spark DataFrame.limit guarantees any order?

Series.reindex