Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Aria Scan Optimizations

See original GitHub issue

Query performance is often dominated by table scan. Hence, improving scan efficiency is important. We propose to change scan implementation to avoid doing extra work and producing extra data: https://code.fb.com/data-infrastructure/aria-presto/

Evaluate simple filters (TupleDomain) directly on encoded (ORC) without producing a Block with all the values first.
- SELECT a FROM t WHERE b > 3 ← never produce a Block for ‘b’
Monitor performance of filters (simple and complex) and dynamically change the order in which columns are read to evaluate most efficient filters first (e.g. filters that drop most rows in least number of cycles).
- SELECT * FROM t WHERE a LIKE '%opera%' AND b = 5 ← evaluate b = 5 first (it is a lot cheaper than LIKE) and read a only for rows that passed b = 5 filter
Prune complex data types (arrays, maps, structs) as early as possible and avoid producing values that are not used in the query.
- SELECT a[2] FROM t ← only read value at index 2 from each row and produce an array with at most 2 elements where element 1 is null
- SELECT a[“op”] FROM t ← only read map values for key “op” and produce a map with at most 1 entry (the original map may have thousands of keys, e.g. feature map)
Put a cap on the amount of data read into a single page by monitoring filter selectivity and average row size and adjusting number of rows to read into next page.
Read as many row groups as necessary to fill in a decent size page. (Dynamically adapt to a particular query.)

The implementation consists of 3 major parts:

Add an optimizer rule to determine subscripts/subfields used in the query and store this information in ColumnHandle - #12926

public interface ColumnHandle
{
    /**
     * Applies to columns of complex types: arrays, maps and structs. When a query
     * uses only some of the subfields, the engine provides the complete list of
     * required subfields and the connector is free to prune the rest.
     * <p>
     * Examples:
     *  - SELECT a[1], b['x'], x.y.z FROM t
     *  - SELECT a FROM t WHERE b['y'] > 10
     * <p>
     * Pruning must preserve the type of the values and support unmodified access.
     * <p>
     * - Pruning a struct means populating some of the members with null values.
     * - Pruning a map means dropping keys not listed in the required subfields.
     * - Pruning arrays means dropping values with indices larger than maximum
     * required index and filling in remaining non-required indices with nulls.
     */
    @Experimental
    default ColumnHandle withRequiredSubfields(List<Subfield> subfields)
    {
        return this;
    }
}

Introduce pushdownFilter metadata API to push down filter on top of table scan into the connector - #12875

    /**
     * Experimental: if true, the engine will invoke pushdownFilter instead of getTableLayouts.
     *
     * This interface can be replaced with a connector optimizer rule once the engine supports these (#12546).
     */
    @Experimental
    default boolean isPushdownFilterSupported(ConnectorSession session, ConnectorTableHandle tableHandle)
    {
        return false;
    }

    /**
     * Experimental: returns table layout that encapsulates the given filter.
     *
     * This interface can be replaced with a connector optimizer rule once the engine supports these (#12546).
     */
    @Experimental
    default ConnectorPushdownFilterResult pushdownFilter(ConnectorSession session, ConnectorTableHandle tableHandle, RowExpression filter, Optional<ConnectorTableLayoutHandle> currentLayoutHandle)
    {
        throw new UnsupportedOperationException();
    }

Implement new scan logic in the Hive connector
- Add support for filter functions - #13066
- Implement filter reordering
- Implement batch size adaptation
- Implement SelectiveStreamReaders for the following types:
  - BOOLEAN - #12991
  - BIGINT, INTEGER, SMALLINT, DATE - #13013
  - BINARY, STRING, VARCHAR, CHAR - TBD
  - BYTE - #13174
  - TIMESTAMP - #13213
  - DECIMAL - TBD
  - FLOAT - #13215
  - DOUBLE - #13208
  - LIST - #13099
  - STRUCT - #13240
  - MAP - #13277
- Implement simple filters on subfields
- Implement subfield pruning
- Implement efficient skipping

A prototype of the above functionality is currently available in aria-scan-research branch.

CC: @oerling @tdcmeehan @elonazoulay @yingsu00 @nezihyigitbasi @arhimondr @zhenxiao @bhhari @sayhar

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:8 (7 by maintainers)

Top GitHub Comments

2reactions

oerlingcommented, Apr 26, 2019

The point of Aria scan is to do everything right in the space of accessing columnar data sources.

One could like this to BigQuery, except that we are here a little more sophisticated with filtering inside repeated structures and adaptivity. We can think of this as a compilation of best practices from the column store world.

Filtering at the column level and maintaining a qualifying set of surviving rows is standard. The fact that baseline Presto does not do this is a culture shock to anybody coming from the database world. Reordering simple filters is also common. Running filter expressions as soon as their arguments are available is another self-evident matter.

There is no intrinsic merit to Presto bytecode generation in filters. This is just a loop over rows with ifs, virtual functions and LazyBlock checks inside.

Breaking functions into sets that depend on discrete sets of columns and doing these early and adaptively is just common sense. This does not particularly affect bytecode gen or its use. It just replaces an arbitrary static order with a dynamic one and makes error behavior be consistent and order independent, i.e. not dependent on the order in which the application writing the queries happens to place them.

The generated bytecode is not particularly good but this is not a pressing issue at this point because the major inefficiences are in the algorithmic structure of scan in general and not in the bytecode in specific.

The connector interface is a minimal interface that corresponds to something like a key value store. The implicit assumption is that the thing one connects to can be out of process.

This is not good for federation and this is also not good for potentially smart in-process data sources like the columnar readers. Degrading is not the word here. Further degrading of functionality would make this entirely unusable. But of course here degrading is used to mean adapting to outside reality. Normally this would be called upgrading.

Indeed, doing what makes sense through the connector interface is one of the top asks out there. And of course people do what they must on an ad hoc basis in any case, e.g. Uber, Alibaba.

Could one do something smart with columnar storage on the engine side, as opposed to the connector side? If the connector interface is something that could conceivably go out of process, the answer is no. If the connector interface produced column readers that had an explicitly in-process interface for seeking to positions and calling a callable for selected values, then the answer is maybe.

The natural interface for federation is query language pplus array parameters. The natural interface for scanning columnar data is a function that loops over positions and applies an operation to the values found there. These are at opposite ends of the stack. If a connector did both of these then this might just as well be two different concepts.

One could conceivably consider a single column scan as an operator. The positions for selected values would be foreign keys of a sort, consumed by the operator that scans the next column. A qualifying set would be an intermediate result column. Such schemes have been considered. MonetDB comes close to doing something like this in its intermediate language.

There are however coordination opportunities between columns that are unlike anything you find in other operators. For example, the notion of prefiltering on row group statistics. This would cause the interface to be quite a bit wider than normally between operators. This is why I am not aware of anybody doing this. A column reader is its own kind of entity, not an operator.

Having said this, it is conceivably possible to have a column scan engine that scans and filters different formats. As long as a column exposes a seek and a scan of selected offsets any format will do. This does not go very far though because as soon as we have nested data there are differences between say ORC and Parquet which make it so that anything dealing with nested data diverges.

LazyBlock is just a mistake meant to cover upp another mistake. The original mistake is not filtering by the column. The follow up mistake is covering up for this by skipping IO or materialization for columns that are very sparsely accessed. These both originate in the misguided idea that one can have a table-wide boundary between reading and selecting.

Suppose one attached a set of actually needed rows to LazyBlock. The place that triggers loading is the filter. This is a row-wise loop that accesses block after block until some condition is false. This model cannot propagate selectivity from first filter to last because in order to do this one would have to run these a column at a time. This is precisely what is accomplished in Aria by breaking down the code generated filter into distinct loops that depend on distinct inputs. And of course we run simple filters that compare columns to constants first, before materializing anything. So the place where a LazyBlock could be qualified by selected rows would b after the filters, where these are loaded anyway, so there would really be no point in laziness in any of these situations.

Some people have speculated on returning LazyBlocks from the scan operator. Again, if there were a very selective hash join, for example as the next operator there could be some point to this. But it is much easier to consider a selective hash join as just a filter expression and run it in the mix of filter expressions. If the hash table is unique, as it is in fact-dimension joins, it is truly a filter. If it is not unique but is still selective, one can still do a prefilter like Bloom or such. This is a time honored tradition, ever since the invisible join in Daniel Abadi’s foundational thesis or even before then.

From: Andrii Rosa notifications@github.com Sent: Tuesday, April 9, 2019 7:18 AM To: prestodb/presto presto@noreply.github.com Cc: oerling erling@xs4all.nl; Mention mention@noreply.github.com Subject: Re: [prestodb/presto] Aria Scan Optimizations (#12585)

(1) optimizer rule to extract subfield paths used in the query;

https://user-images.githubusercontent.com/5570988/55806640-a9bc2580-5aae-11e9-9eb7-bc183e927763.png

(2) SPI changes to pushdown filters and referenced subfield paths;

https://user-images.githubusercontent.com/5570988/55806640-a9bc2580-5aae-11e9-9eb7-bc183e927763.png

(3) a set of helper functions to compile filters and extract TupleDomain conjuncts.

That requires some discussions.

Currently Presto has a clear level of abstraction between the engine, and the data source. The responsibility of the data source - is to only provide the data to the engine. The responsibility of the engine - is to process the data (apply filters, aggregations, joins). Moving filter evaluation to the connector degrade this clear separation, as it effectively moves the part of the execution to the connector itself.

Also the compiler framework Presto has is a very thoroughly designed and engineered piece of software. We should think very hard before deciding to go and re-implement that functionality in the Hive connector. We don’t want to loose our current optimization in favour of new optimizations. Run-time filter reordering is a very promising optimization, but while implementing that we don’t want to loose very efficient compilation of the complicated expression trees that we have right now.

a set of helper functions

If this can be done with a set of helper function - it implies some level of abstraction. Instead of reinventing a new level of abstraction does it make sense to first identify what is wrong with the abstraction we have so far? If there something wrong - is there a way to fix it?

Re: cap - LazyBlock doesn’t provide any caps on total memory needed to load it. In the presence of very large rows, loading a single LazyBlock worth 10K rows may exceed memory limits. The new logic will adapt the number of rows being read based on observed row size and filter selectivity. Each column reader will then be provided with the number of rows to read and max memory to use. If memory limit is reached before the requested number of rows have been read, a retry with dramatically reduced number of rows will occur. Number of rows will increase gradually afterwards (if the data allows).

If we implemented the ReallyLazyBlock there would be no need to load 10K rows at a time. Is there a clear explanation why something like ReallyLazyBlock is not an option?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto/issues/12585#issuecomment-481271024 , or mute the thread https://github.com/notifications/unsubscribe-auth/Ap73z6wffzZPHxjRMFFv-0sQQEI-ff0Iks5vfKEAgaJpZM4cdrLk .

1reaction

mbasmanovacommented, Dec 11, 2019

The new scan is largely complete. Use session properties to enable:

set session pushdown_subfields_enabled=true;
set session hive.pushdown_filter_enabled=true;

Top Results From Across the Web

Getting Started with PrestoDB and Aria Scan Optimizations ·

PrestoDB recently released a set of experimental features under their Aria project in order to increase table scan performance of data stored in ......

Aria Presto: Making table scan more efficient

Table scan optimizations are specific to queries that access data stored in ORC format but can be extended to support Parquet as well....

Aria Presto: Making table scan more efficient - Ask Sendai

Table scan optimizations are specific to queries that access data stored in ORC format but can be extended to support Parquet as well....

ARIA Oncology Information System - Varian

We can help you optimize image-guided treatment techniques using radiographic, fluoroscopic, and cone-beam CT images. 5. Assistance with regulatory compliance.

Amyloid Related Imaging Abnormalities (ARIA) in ... - NCBI - NIH

Characteristics of ARIA-E observed in amyloid modifying therapeutic trials ... Scan Sequences: 2D T2*GRE, to identify ARIA-H, are presently available on any ...