Optimize Hive Connector's file path filtering merchanism
See original GitHub issueCurrently Hive Connector hive a merchanism to filter the file path to be scanned:
private Iterator<HiveFileInfo> getLocatedFileStatusRemoteIterator(Path path, PathFilter pathFilter)
{
try (TimeStat.BlockTimer ignored = namenodeStats.getListLocatedStatus().time()) {
return Iterators.filter(new FileStatusIterator(path, listDirectoryOperation, namenodeStats), input -> pathFilter.accept(input.getPath()));
}
}
The filtering is using PathFilter, which only have the path, dropped the information which is available in HiveFileInfo
e.g. HiveFileInfo#isDirectory()
, I’d suggest we do the filtering using HiveFileInfo:
public interface HiveFileInfoFilter
{
boolean accept(HiveFileInfo file);
}
If you guys are ok with this, I’d be happy to contribute a patch to optimize this.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Tuning Apache Hive Performance on the Amazon S3 ...
Running the MSCK command with the REPAIR TABLE option is a simple way to bulk add partitions to Hive tables. See the Apache...
Read more >Hive connector security configuration — Starburst Enterprise
You can enable authorization checks for the Hive connector by setting the hive.security property in the Hive catalog properties file.
Read more >Hive Connector and Link Properties | Teradata QueryGrid - 2.12
Custom JAR Path, None, Specifies the path or paths to use for .jar files not listed in Hadoop JAR Files. Enter paths in...
Read more >Hive Configuration Properties - Apache Software Foundation
For information about how to use these configuration properties, see Configuring Hive. That document also describes administrative configuration ...
Read more >RaptorX: Building a 10X Faster Presto ·
The following figure shows the IO paths for Hive connectors in orange ... on file paths to particular workers to maximize cache hit...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mbasmanova ok, I will submit a PR recently. After the filtering mechanism is optimized, our private fork of Presto will plugin in a piece of code in BackgroundSplitLoader, which will construct a HiveFileInfoFilter using the following logic:
And this pattern will be passed along into HiveFileIterator.
@xumingming Got it. Thanks for explaining. Looking forward to a PR.