Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal on Hash Join and Memory Management Efficiency Improvements

See original GitHub issue

The following is a proposal from @oerling that is posted below with only minor formatting edits. It is presented here for the community to comment and solicit feedback related to the proposal.

Explorations in Hash Join and Memory Management

Orri Erling, Oct 26, 2018

Introduction

This document describes a series of experiments around memory management and memory bandwidth. The test scenario is building and probing a hash table as in a hash join. We demonstrate a CPU friendly hash join implementation based on generating multiple concurrent, data independent cache misses. We use this against different hash table sizes to quantify cache effects. Then we compare Java heap and off-heap memory. We further quantify the cost of always allocating new memory for inter-operator results as opposed to reusing the same memory for consecutive batches.

We measure GC pressure from the creation of short lived Blocks and Pages and from long lived hash table components. In order to accentuate the impact of long lived memory, we build the hash tables and probe each entry once. This is atypical for a real hash join, where the probe is expected to be much larger than the input and often contain probe keys without corresponding build keys.

The hash table is structured as follows:

Array of 8 byte status words. These contain 8 7 bit extracts of the hash number for the entry at the corresponding position in the hash table.
Array of pointers to build side rows. These correspond position-wise to the status bytes in the previous array
Build side structured as a set of 128K memory slabs that hold the actual build side rows back to back.

We could in principle do away with the array of pointers if we had extra space in the array of build side rows. This would however drive up memory consumption. Now we have only 9 bytes * (1 – load factor) overhead. The alternative would be (1 + fixed length part of build side) * (1 - load factor).

The probe picks an 8 byte status word based on the hash number. It transforms the status word into a bit set indicating matches of the hash number extracts. It then looks at the first match.

The processing of the status word goes as follows:

Make a word filled with the 7 bit hash number extracts: mask = extract; mask |= mask << 8; mask |= mask < 16; mask |= mask << 32;
Xor the mask with the status word. A hit will be a byte of zeros. Subtract 0x010101… Each zero byte now has its high bit set. And with 0x808080…
Xor away the positions where the high bit was already set (empty positions).
Get number of trailing zeros. This gives the position of the match. Load the pointer to the row for this and compare the row with the probe row.
Clear the lowest bit of the matches bit mask: mask &= mask – 1;
If there was no hit for the first try, loop over all the matched bytes of the status word. If none hits and the status word has no empty status bytes, repeat with the next status word.

The same can be implemented more efficiently with SSE instructions over a 16 byte status word but Java does not expose the needed intrinsics.

In most cases there is a fixed length code path for either hit or miss. The false positive chance for the hash number extract is about 6 / 128 for a hash table with 6 of 8 slots in use. The miss is detected in a fixed number of instructions for the case where there was no match of the 7 bit extract and there was at least one empty in the status word.

The probe is divided into the following stages:

Preprobe: Calculate the nmask with all bytes set to the 7 bit extract of the hash number and fetch the status word.
First probe: Make the hits bitset. If this is non-empty, compare the build side row to the probe values. This loads the pointer to the probe row and fetches the keys and compares with the probe side
Full probe: This checks the result from the first probe and if this is a match adds the result to the output. This is the most frequent case. The next most likely case isis a miss. The least likely case is a loop over the remaining matches in the status word and possibly next status words.

The loop is unrolled so that if we have at least 4 probes, we have 4 pre-probes, resulting in 4 data independent loads. Then we have 4 first probes of which each depends on the corresponding pre-probe. These store the comparison result in a variable but do not branch on that. E.g. the comparison is like: flag = build == probe[row]. Then we have 4 full probes, each tests the result of the corresponding first probe.

In this way chances are that the load has completed before the result needs to be branched on. The general idea is to issue multiple loads before branching on a result. The first probe has two consecutive misses: It first fetches a pointer to the build side, then dereferences this, then sets a flag based on the result. Testing the flag is deferred so that there is a chance for the data dependent loads to have completed by the time the result is accessed.

In practice we see a 15% speedup from unrolling the loop. The loop is short enough that even without unrolling there will be prefetching for the next iteration from speculative execution.

The build side is stored row-wise. The probe side comes in Blocks. The base array is extracted for each Block and an extra array of row number mapping is used for the cases where this is needed: build == blockBaseArray[rowNumberMap[row]].

There can be a Bloom filter for pre-selecting the hash probes. However, since we already detect most misses by just loading a single word, the gain from the Bloom filter is modest and in some cases this even slows down the probe.

The experiment setup has a hash table with 2 keys and one payload column, all of these are longs. The hash table has provisions for repeating keys but in the experiments all the build keys are unique. The hash table structure is constant across all experiments. We vary the size, parallelism and memory allocation for the hash table itself as well as the build and probe inputs

This corresponds , for example to the query select supplycost from lineitem l, partsupp ps where l.partkey = ps.partkey and l.suppkey = ps.suppkey.

The build side rows are either stored as 128K Slices on the Java heap or as 128K chunks of offheap memory. The status word table and the table of pointers to build side rows are either long[] on the Java heap or arrays of longs off-heap. For the case of Slices, we either allocate the Slices and leave them as garbage or keep a pool of reusable Slices for future hash joins.

For inter-operator results, we either reuse one Page with its Blocks for consecutive build/probe/join result Pages or allocate a new Page each time.

Comparison With Current Presto

We compare the slowest case of the Aria table (Slices and always new Pages) with the current Presto hash build and probe operators. We notice that Presto has a fixed build overhead of about 100ms, presumably arising from compiling code, which happens also when several hash tables have exactly the same hash strategy class. Therefore we ignore the result with the smallest size.

Presto new Pages
Build 8388608: 1363 ms
Probe 8388608: 6498 ms 8388608 hits

Aria new Pages
Build 8388608: 1291 ms
Probe 8388608: 2866 ms 8388608 hits

Presto new Pages
Build 33554432: 5337 ms
Probe 33554432: 28803 ms 33554432 hits

Aria new Pages
Build 33554432: 5583 ms
Probe 33554432: 8226 ms 33554432 hits

The above shows build and probe times for 8M and 32M entry table sizes with the current Presto operators and the slowest configuration of the Aria operators., i.e. the case where Pages and Blocks are not reused and where the Slices and arrays that make up the hash table are not reused.

We note that the build times are very similar. This is because Presto just takes ownership of the Pages from the build source whereas Aria copies these into Slices. Otherwise the Aria hash table creation is more efficient, hence it completes in about the same time.

The Aria probe is 2.3x faster at the 8M size and 3.5x faster at the 32M size. The larger table shows more gain presunmably because of more cache misses and wins from having more of these in progress at a time. We note that the Aria layout makes 2.5 misses per probe whereas the Presto layout is at least 4.5 misses. This is because the two build side keys and the payload column are all in different Blocks and must be fetched separately whereas Aria has these contiguously.

Aria Table Results

The results are as follows:

Current Presto

=== Presto Large table 16 threads new pages build 33554432 probe 33554432 threads 693365 ms

Since the build and probe sizes are the same, this gives Presto a better score than it would have in practice, where probe is usually much larger. We recall that Presto and Aria are even at build speed and Aria is notably faster at probe.

Java Heap, New Pages, new build Slices

=== Small table1 threads  new pages  build 16384 probe 16384 threads 41332 ms
=== Small table16 threads  new pages  build 16384 probe 16384 threads 64527 ms
=== Large table 16 threads  new pages  build 33554432 probe 33554432 threads 332751 ms

Java Heap, Pages reused, Build Slices reused

=== Small table1 threads  page reuse  build 16384 probe 16384 threads 25455 ms
=== Small table16 threads  page reuse  build 16384 probe 16384 threads 40577 ms
=== Large table 16 threads  page reuse  build 33554432 probe 33554432 threads 252045 ms

Offheap, New Pages

=== Small table1 threads  new pages  build 16384 probe 16384 threads 23889 ms
=== Small table16 threads  new pages  build 16384 probe 16384 threads 41334 ms
=== Large table 16 threads  new pages  build 33554432 probe 33554432 threads 250983 ms

Offheap, Pages reused

=== Small table1 threads  page reuse  build 16384 probe 16384 threads 17308 ms
=== Small table16 threads  page reuse  build 16384 probe 16384 threads 27821 ms
=== Large table 16 threads  page reuse  build 33554432 probe 33554432 threads 185185 ms

The small table build/probe pairs are run 20480 times per thread. The large table build/probe pairs are run 10 times per thread. In this way each thread does 32M x 10 rows of both build and probe. The difference between the small table and large table experiments is exclusively due to memory bandwidth. The small table (5 * 16K * 8) fits entirely in L3, the large table (5 * 32M * 8) misses all levels of cache more than twice per probe.

The single threaded run time of the 32M entry probe is 4s. This means 8M hits per second. This is very close to the 9.5M data dependent cache misses measured on the test system by traversing a linked list (almost no instructions but each load depends on the completion of the previous load). With an expected near 2.5 misses per probe key we get good utilization of memory bandwidth. The rationale for 2.5 misses is that we have a 32MB table of status words for a 16MB L3 cache. Then we have a 128MB array of pointers to build rows and 512MB of build rows.

Considering the single threaded and 16 thread cases of the small table, we go from single core to full use of hyperthreading and from 17s to 27s of time. We get about 10 cores worth of work from 8 cores with hyperthreading. This is in the expected ballpark.

With the 32M entry hash tables, we go from a 4s single threaded probe to a 15s probe with 16 threads. 8 cores deliver about 4 cores of throughput because the execution is bound by memory bandwidth. This is again as expected. The test system has 2 channels of memory per socket.

The test cases with offheap hash tables allow us to see the GC overhead coming from generating short lived garbage in the form of Pages and Blocks. The 16 thread test with small hash table goes from 27s to 41s (+51%) as a result of frequent young GC. The test with large hash tables shows a smaller GC impact because the bulk of the time goes to waiting for memory (184s to 250 (+35%). Using offheap memory is somewhat faster than Slices and Java arrays. The best comparison for this is given by the single threaded small table experiment with Pages and build Slices reused. There we have 17s vs 25s. The on-heap case is however not entirely garbage free because it still allocates the status words and pointer arrays on heap and thus generates GCs. Going to large tables gives us a smaller difference because memory bandwidth is an equalizing factor (185 to 250s)

Conclusions

In Presto profiles the GC arising from allocating new Pages/Blocks is not very significant, under 5%, except in extreme cases like select sum (extendedprice) where suppkey = 111;. In the latter case, GC is around 20% because Blocks of suppkey values are generated, tested and immediately thrown away. We see a similar dynamic with the in principle more CPU intensive has probe of a large table once the hash probe itself is optimized. We infer that once we have a near-optimal implementation of common query operators, e.g.scan, repartition. Hash join, aggregation, there will be a 20-30% upside in reusing Pages and Blocks between operators. This is possible in cases where the downstream operators have fully processed a Page of output before consuming the next Page from a producer operator. This means that there will be no hanging references to the previous Page or its constituents. This is the case, for example, with ascan followed by partitioned output or for an exchange followed by hash probe followed by partitioned output.

Another conclusion is that Slices and Java arrays are not radically worse than unmanaged memory. The cost of these is a slowdown of 252s / 185s = 1.36x for large hash tables. There is at least equal optimization potential from generating less garbage. We expect the greatest gains of all to be obtained in optimizing scan by filtering earlier and copying less. This is independent of hash join.

Issue Analytics

State:
Created 5 years ago
Comments:18 (11 by maintainers)

Top GitHub Comments

1reaction

nezihyigitbasicommented, Dec 11, 2018

AFAIK the C2 generated code doesn’t have profiling instructions (C1 should have done the profiling by then). @oerling Please make sure that the code is hot enough and compiled with the C2 compiler. You can confirm that by looking at the compiler logs (e.g., -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation).

0reactions

stale[bot]commented, Dec 12, 2020

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.