question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize Hive Connector's file path filtering merchanism

See original GitHub issue

Currently Hive Connector hive a merchanism to filter the file path to be scanned:

    private Iterator<HiveFileInfo> getLocatedFileStatusRemoteIterator(Path path, PathFilter pathFilter)
    {
        try (TimeStat.BlockTimer ignored = namenodeStats.getListLocatedStatus().time()) {
            return Iterators.filter(new FileStatusIterator(path, listDirectoryOperation, namenodeStats), input -> pathFilter.accept(input.getPath()));
        }
    }

The filtering is using PathFilter, which only have the path, dropped the information which is available in HiveFileInfo e.g. HiveFileInfo#isDirectory(), I’d suggest we do the filtering using HiveFileInfo:

public interface HiveFileInfoFilter
{
    boolean accept(HiveFileInfo file);
}

If you guys are ok with this, I’d be happy to contribute a patch to optimize this.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
xumingmingcommented, Jul 1, 2020

@mbasmanova ok, I will submit a PR recently. After the filtering mechanism is optimized, our private fork of Presto will plugin in a piece of code in BackgroundSplitLoader, which will construct a HiveFileInfoFilter using the following logic:

List<String> patterns = <Get the filename pattern from Hive Metastore>
hiveFileInfoFilter = new FilePatternHiveFileInfoFilter(patterns);

And this pattern will be passed along into HiveFileIterator.

0reactions
mbasmanovacommented, Jul 1, 2020

@xumingming Got it. Thanks for explaining. Looking forward to a PR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tuning Apache Hive Performance on the Amazon S3 ...
Running the MSCK command with the REPAIR TABLE option is a simple way to bulk add partitions to Hive tables. See the Apache...
Read more >
Hive connector security configuration — Starburst Enterprise
You can enable authorization checks for the Hive connector by setting the hive.security property in the Hive catalog properties file.
Read more >
Hive Connector and Link Properties | Teradata QueryGrid - 2.12
Custom JAR Path, None, Specifies the path or paths to use for .jar files not listed in Hadoop JAR Files. Enter paths in...
Read more >
Hive Configuration Properties - Apache Software Foundation
For information about how to use these configuration properties, see Configuring Hive. That document also describes administrative configuration ...
Read more >
RaptorX: Building a 10X Faster Presto ·
The following figure shows the IO paths for Hive connectors in orange ... on file paths to particular workers to maximize cache hit...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found