Fix ORC Bloom Filter
See original GitHub issueORC Bloom Filter Support has been broken in latest presto release.Either we should fix this in corresponding previous branches too OR we should mark this in release note that ORC Bloom Filter Support while Querying ORC table having Bloom Filter will not take advantage of Bloom Filter.
The support is broken from Presto Release 0.214.
After changes of StreamId in readBloomFilterIndexes method of StripeReader class the Bloom filter does not skip unsatisfied Row Group of ORC due to coding bug as the below line return always null
.
StripeReader.java
List<HiveBloomFilter> bloomFilters = bloomFilterIndexes.get(entry.getKey());
@kevinwilfong
@dain Please have a look, This have an impact on Presto ORC performance.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
ORC bloom filter in Trino working? · Issue #9792 - GitHub
I created a transactional table in ORC Format with a bloom filter for some columns in Hive, and inserted some rows in Trino....
Read more >Advanced ORC properties - Cloudera Documentation
Sets whether to create row indexes. orc.bloom.filter.columns, --, Comma-separated list of column names for which a Bloom filter must be created.
Read more >[#SPARK-12417] Orc bloom filter options are not propagated ...
ORC bloom filter is supported by the version of hive used in Spark 1.5.2. However, when trying to create orc file with bloom...
Read more >Hive Optimizations with Indexes, Bloom-Filters and Statistics
You can tune a bloom filter by configuring the false positive rate (‚orc.bloom.filter.fpp'), but you only should deviate from the default if ...
Read more >spark ORC fine tuning (file size, stripes) - hive - Stack Overflow
usersDF.write.format("orc") .option("orc.bloom.filter.columns", ... HDP 2.6.4 and native ORC support should already be fixed.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Echoing Maria’s comment on the PR and Wenlei’s comment here, could you add a test that demonstrates the problem. It’s not immediately obvious to me why that line would always return null. I’m also concerned this fix wouldn’t work correctly for flat maps.
@dilipkasana
I am curious why is that? Since
StreamId
just containscolumn
,sequence
(always 0 for ORC) andstreamKind
(should be the same for the same column) right ?It might be something incorrect with
StreamId
that makes it not working inHashMap
, although I didn’t see anything obviously wrong with itshashCode
andequals
method.Ignoring
sequence
would cause bloom filter not work correctly for DWRF flat map.