question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does Koalas respect partitions when filtering on partitioned columns?

See original GitHub issue

I’d posted earlier about how .head() seemed slow for a large dataset when compared to a spark dataframe’s .head() (Prev Issue). This was solved by switching to distributed type index.

However, using the distributed type index didn’t seem to speed it up for filters on partitioned columns. For context, I have a partitioned parquet table (117M rows, ~1000 partitions) that is stored on S3 with the Hive metastore pointing to a table name, say, table_foo. Let’s say it’s partitioned by columns, p1, p2.

I tried to query just the partitions too to see if it speeds it up, but it was still as slow. This makes me think the partitions are actually not being respected in the koalas dataframe when using filters like below. Is that accurate?

kdf[(kdf['p1']=='hello') & (kdf['p2']=='world')].head()

If I read the table as a spark dataframe, it seems to work as expected:

sdf = spark.read.table('table_foo')

I can run something like:

sdf.filter("p1 = 'hello' and p2 = 'world'").head(20)

and that is really fast.

When I filter on partition columns, it seems to be scanning the whole table again. Any tips on if I’m doing something wrong here?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ueshincommented, Nov 11, 2019

Seems like other default indexes also use non-deterministic expressions. We should use index_col as long as possible.

0reactions
HyukjinKwoncommented, Mar 12, 2020

I am closing this ticket as the workaround is provided at this moment.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best Practices — Koalas 1.8.2 documentation - Read the Docs
One common issue when Koalas users face is the slow performance by default index. Koalas attaches a default index when the index is...
Read more >
Data skipping index | Databricks on AWS
As skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files. This will typically ...
Read more >
Fast Filtering with Spark PartitionFilters and PushedFilters
Spark can use the disk partitioning of files to greatly speed up certain filtering operations. This post explains the difference between ...
Read more >
Querying Large Parquet Files with Pandas - Open Data Blend
These will be referred to as Fact DataFrames. Specify the required columns and filter predicates to be used for all the English Prescribing...
Read more >
Koalas applymap moving all data to a single partition
You can fix this by specifying an index on the Koalas DataFrame. The default index is expected to give poor performance.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found