Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Varying performance for group by queries

See original GitHub issue

CrateDB version: 4.5.1

Environment description: macOS; started with cr8

Problem description:

Recently a performance problem was reported in the Crate.io Community. I downloaded the provided snapshot and could reproduce the same behaviour.

Further queries did show differences in query duration which I don’t fully understand:

Unlimited group by

$> echo "select f from doc.test5kk_025 group by f;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    6962.284 ± 116.587

Limited group by

$> echo "select f from doc.test5kk_025 group by f limit 1000000;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    2905.901 ± 34.236

Unlimited group by with values doubled

$> echo "select f*2 as ff from doc.test5kk_025 group by ff;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    1188.969 ± 79.297

Limited group by with values doubled

$> echo "select f*2 as ff from doc.test5kk_025 group by ff limit 1000000;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    3831.521 ± 21.244

The table has 5 million integer/bigint records (range of records between 18-5456946) with around 790k distinct numbers.

Questions:

Why is “Limited group by” twice as fast as “Unlimited group by”?
Why is “Unlimited group by with values doubled” over 5 times as fast as “Unlimited group by”?

Steps to reproduce:

Create Table statement:

CREATE TABLE IF NOT EXISTS "doc"."test5kk_025" (
   "f" BIGINT DEFAULT NULL
)
CLUSTERED INTO 35 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

See the linked community post for a dataset to reproduce my results.

Issue Analytics

State:
Created 2 years ago
Comments:14 (9 by maintainers)

Top GitHub Comments

1reaction

kamcio181commented, Jun 8, 2021

@mfussenegger It would be great if we could pick hash implementation. I have compiled crate 4.5.1 replacing hashCode function in LongObjectHashMap with xxhash implementation from lz4 lib

    private static final XXHash32 XX_HASH_32 = XXHashFactory.fastestInstance().hash32();
    private static final int SEED = 0x9747b28c;
    /**
     * Returns the hash code for the key.
     */
    private static int hashCode(long key) {
        return XX_HASH_32.hash(Longs.toByteArray(key), 0, Long.BYTES, SEED);
    }

The time has dropped from 11.6s to 3.4 which is bigger than 2.2s which I have received on doubled keys. Probably hashing overhead is quite big. Also I have tried Long2ObjectOpenHashMap from fastutill and have received results in 2.8s. Maybe imprementation from netty is not the best the Crate may use;) It would be great if you could evaluate other hashMaps, ie fastutil or eclipse-collections in case of performance, memory footprint and collision management

0reactions

BaurzhanSakharievcommented, Mar 16, 2022

Hi @kamcio181, we evaluated more different map options and some performed better on your data set but none perform consistently better than current implementation over all test data sets.

Top Results From Across the Web

SQL GROUP BY- 3 Easy Tips to Group Results Like a Pro

You can use SQL GROUP BY to divide rows in results into groups with an aggregate function. It sounds easy to sum, average,...

Optimising GROUP BY Queries in SQL Server - Medium

Using GROUP BY on columns that don't need to be aggregated can really hamper performance, and can result in substantially higher reads. A...

Effects of ORDER BY and GROUP BY on SQL performance

Sorting and Grouping. Sorting is a very resource intensive operation. It needs a fair amount of CPU time, but the main problem is...

Speed up MySQL Queries GROUP BY with subselects - Percona

To speed up MySQL queries, you can add GROUP BY to group only the needed data and then perform the joins over other...

Performance Surprises and Assumptions : GROUP BY vs ...

Aaron Bertrand acknowledges that DISTINCT and GROUP BY are usually interchangeable, but shows there are cases where one performs better than ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Varying performance for group by queries

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

`pg_typeof()` fails for object column expressions (plus some other potential type issue)

Observed flakyness when using the "fast_executemany" option with the "DataDirect PostgreSQL ODBC Driver"