question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fix ORC Bloom Filter

See original GitHub issue

ORC Bloom Filter Support has been broken in latest presto release.Either we should fix this in corresponding previous branches too OR we should mark this in release note that ORC Bloom Filter Support while Querying ORC table having Bloom Filter will not take advantage of Bloom Filter. The support is broken from Presto Release 0.214. After changes of StreamId in readBloomFilterIndexes method of StripeReader class the Bloom filter does not skip unsatisfied Row Group of ORC due to coding bug as the below line return always null.

StripeReader.java List<HiveBloomFilter> bloomFilters = bloomFilterIndexes.get(entry.getKey()); @kevinwilfong @dain Please have a look, This have an impact on Presto ORC performance.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
kevinwilfongcommented, Jun 4, 2019

Echoing Maria’s comment on the PR and Wenlei’s comment here, could you add a test that demonstrates the problem. It’s not immediately obvious to me why that line would always return null. I’m also concerned this fix wouldn’t work correctly for flat maps.

1reaction
wenleixcommented, Jun 4, 2019

@dilipkasana

After changes of StreamId in readBloomFilterIndexes method of StripeReader class the Bloom filter does not skip unsatisfied Row Group of ORC due to coding bug as the below line return always null.

I am curious why is that? Since StreamId just contains column, sequence (always 0 for ORC) and streamKind (should be the same for the same column) right ?

It might be something incorrect with StreamId that makes it not working in HashMap, although I didn’t see anything obviously wrong with its hashCode and equals method.

Ignoring sequence would cause bloom filter not work correctly for DWRF flat map.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ORC bloom filter in Trino working? · Issue #9792 - GitHub
I created a transactional table in ORC Format with a bloom filter for some columns in Hive, and inserted some rows in Trino....
Read more >
Advanced ORC properties - Cloudera Documentation
Sets whether to create row indexes. orc.bloom.filter.columns, --, Comma-separated list of column names for which a Bloom filter must be created.
Read more >
[#SPARK-12417] Orc bloom filter options are not propagated ...
ORC bloom filter is supported by the version of hive used in Spark 1.5.2. However, when trying to create orc file with bloom...
Read more >
Hive Optimizations with Indexes, Bloom-Filters and Statistics
You can tune a bloom filter by configuring the false positive rate (‚orc.bloom.filter.fpp'), but you only should deviate from the default if ...
Read more >
spark ORC fine tuning (file size, stripes) - hive - Stack Overflow
usersDF.write.format("orc") .option("orc.bloom.filter.columns", ... HDP 2.6.4 and native ORC support should already be fixed.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found