question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Varying performance for group by queries

See original GitHub issue

CrateDB version: 4.5.1

Environment description: macOS; started with cr8

Problem description:

Recently a performance problem was reported in the Crate.io Community. I downloaded the provided snapshot and could reproduce the same behaviour.

Further queries did show differences in query duration which I don’t fully understand:

Unlimited group by

$> echo "select f from doc.test5kk_025 group by f;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    6962.284 ± 116.587

Limited group by

$> echo "select f from doc.test5kk_025 group by f limit 1000000;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    2905.901 ± 34.236

Unlimited group by with values doubled

$> echo "select f*2 as ff from doc.test5kk_025 group by ff;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    1188.969 ± 79.297

Limited group by with values doubled

$> echo "select f*2 as ff from doc.test5kk_025 group by ff limit 1000000;" | cr8 timeit --host http://localhost:4200 -w 5 -r 10
Runtime (in ms):
    mean:    3831.521 ± 21.244

The table has 5 million integer/bigint records (range of records between 18-5456946) with around 790k distinct numbers.

Questions:

  • Why is “Limited group by” twice as fast as “Unlimited group by”?
  • Why is “Unlimited group by with values doubled” over 5 times as fast as “Unlimited group by”?

Steps to reproduce:

Create Table statement:

CREATE TABLE IF NOT EXISTS "doc"."test5kk_025" (
   "f" BIGINT DEFAULT NULL
)
CLUSTERED INTO 35 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

See the linked community post for a dataset to reproduce my results.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
kamcio181commented, Jun 8, 2021

@mfussenegger It would be great if we could pick hash implementation. I have compiled crate 4.5.1 replacing hashCode function in LongObjectHashMap with xxhash implementation from lz4 lib

    private static final XXHash32 XX_HASH_32 = XXHashFactory.fastestInstance().hash32();
    private static final int SEED = 0x9747b28c;
    /**
     * Returns the hash code for the key.
     */
    private static int hashCode(long key) {
        return XX_HASH_32.hash(Longs.toByteArray(key), 0, Long.BYTES, SEED);
    }

The time has dropped from 11.6s to 3.4 which is bigger than 2.2s which I have received on doubled keys. Probably hashing overhead is quite big. Also I have tried Long2ObjectOpenHashMap from fastutill and have received results in 2.8s. Maybe imprementation from netty is not the best the Crate may use;) It would be great if you could evaluate other hashMaps, ie fastutil or eclipse-collections in case of performance, memory footprint and collision management

0reactions
BaurzhanSakharievcommented, Mar 16, 2022

Hi @kamcio181, we evaluated more different map options and some performed better on your data set but none perform consistently better than current implementation over all test data sets.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SQL GROUP BY- 3 Easy Tips to Group Results Like a Pro
You can use SQL GROUP BY to divide rows in results into groups with an aggregate function. It sounds easy to sum, average,...
Read more >
Optimising GROUP BY Queries in SQL Server - Medium
Using GROUP BY on columns that don't need to be aggregated can really hamper performance, and can result in substantially higher reads. A...
Read more >
Effects of ORDER BY and GROUP BY on SQL performance
Sorting and Grouping. Sorting is a very resource intensive operation. It needs a fair amount of CPU time, but the main problem is...
Read more >
Speed up MySQL Queries GROUP BY with subselects - Percona
To speed up MySQL queries, you can add GROUP BY to group only the needed data and then perform the joins over other...
Read more >
Performance Surprises and Assumptions : GROUP BY vs ...
Aaron Bertrand acknowledges that DISTINCT and GROUP BY are usually interchangeable, but shows there are cases where one performs better than ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found