question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Query performance

See original GitHub issue

CrateDB version: 1.0.5

JVM version:

java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

OS version / environment description:

Linux world-db-21 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

Problem description:

Running a query against a smaller subset of data that was ingested post-1.0.5 is 10x slower than querying a larger, 2x, subset of data that was ingested pre-1.0.5.

Feature description:

First, the total row counts of the data in question. The issue pertains to these two ingestid datasets.

doc=> select count(*)
        from ourdata
       where ingestid = '3a9869fb2963cee218e296db79d8b612';
 count(*)  
-----------
 375092997
(1 row)

Time: 1.986 ms

doc=> select count(*)
        from ourdata
       where ingestid = '9b4cbbe4dd1e0d26debb375328c74354';
 count(*)  
-----------
 867072963
(1 row)

Time: 2.007 ms

Of those ☝️ two ingestid, the 9b4... was ingested pre-1.0.5, while 3a9... was ingested post-1.05. And those are respectable query times (i picked the best of 5 consecutive queries).

The problems comes when i add to the where clause:

doc=> select count(*)
        from ourdata
       where ingestid = '3a9869fb2963cee218e296db79d8b612'
         and data['t_deviceid'] = '1aab41025c6c62cc20320e65510a464cc9d85c057b38faa8e5c3409f2bcef673'
         and data['t_lastseen'] != '\N';
 count(*) 
----------
     2733
(1 row)

Time: 306.863 ms


doc=> select count(*)
        from ourdata
       where ingestid = '9b4cbbe4dd1e0d26debb375328c74354'
         and data['t_deviceid'] = '46fe779f2cb3d2926b6a4fee98afb7b5db719faab321181bc429f8f133d3d06a'
         and data['t_lastseen'] != '\N';
 count(*) 
----------
     5478
(1 row)

Time: 29.191 ms

The 3a9... ingestid is 10x slower than 9b4... even though 9b4... is 2x more total rows.

doc=> select count(*)
        from ourdata
       where ingestid = '3a9869fb2963cee218e296db79d8b612'
         and data['t_lastseen'] != '\N';
 count(*)  
-----------
 230985658
(1 row)

Time: 1332.693 ms

doc=> select count(*)
        from ourdata
       where ingestid = '9b4cbbe4dd1e0d26debb375328c74354'
         and data['t_lastseen'] != '\N';
 count(*)  
-----------
 395407210
(1 row)

Time: 435.651 ms

Ok, it seems related to the != condition. For 3a9... it’s returning 61% of the rows vs 45% for 9b4.... Even so, querying 61% of 375M rows should be faster than 45% of 870M rows, right?

The ourdata table’s composite primary key includes an md5 and is relatively evenly distributed amongst our five nodes and 32 shards.

ourdata 22 p STARTED 59542887 27.8gb 10.0.111.22 Roteck         
ourdata 30 p STARTED 59558618 27.8gb 10.0.111.21 Hohgant        
ourdata 21 p STARTED 59548595 27.8gb 10.0.111.23 Grande Rochère 
ourdata 17 p STARTED 59551730 27.8gb 10.0.111.22 Roteck         
ourdata 20 p STARTED 59572835 27.8gb 10.0.111.21 Hohgant        
ourdata 9  p STARTED 59557794 27.8gb 10.0.111.25 Setsas         
ourdata 1  p STARTED 59565419 27.8gb 10.0.111.23 Grande Rochère 
ourdata 29 p STARTED 59561467 27.8gb 10.0.111.25 Setsas         
ourdata 19 p STARTED 59560201 27.8gb 10.0.111.25 Setsas         
ourdata 3  p STARTED 59554320 27.8gb 10.0.111.24 Moléson        
ourdata 12 p STARTED 59540949 27.8gb 10.0.111.22 Roteck         
ourdata 16 p STARTED 59555583 27.9gb 10.0.111.23 Grande Rochère 
ourdata 5  p STARTED 59557970 27.8gb 10.0.111.21 Hohgant        
ourdata 31 p STARTED 59551067 27.8gb 10.0.111.23 Grande Rochère 
ourdata 2  p STARTED 59564977 27.8gb 10.0.111.22 Roteck         
ourdata 24 p STARTED 59552692 27.8gb 10.0.111.25 Setsas         
ourdata 6  p STARTED 59554328 27.9gb 10.0.111.23 Grande Rochère 
ourdata 27 p STARTED 59565296 27.8gb 10.0.111.22 Roteck         
ourdata 26 p STARTED 59552178 27.7gb 10.0.111.23 Grande Rochère 
ourdata 11 p STARTED 59551670 27.8gb 10.0.111.23 Grande Rochère 
ourdata 10 p STARTED 59561895 27.8gb 10.0.111.21 Hohgant        
ourdata 18 p STARTED 59551674 27.8gb 10.0.111.24 Moléson        
ourdata 4  p STARTED 59559826 27.8gb 10.0.111.25 Setsas         
ourdata 14 p STARTED 59551315 27.8gb 10.0.111.25 Setsas         
ourdata 8  p STARTED 59559626 27.8gb 10.0.111.24 Moléson        
ourdata 15 p STARTED 59558203 27.8gb 10.0.111.21 Hohgant        
ourdata 25 p STARTED 59564711 27.8gb 10.0.111.21 Hohgant        
ourdata 28 p STARTED 59562530 27.8gb 10.0.111.24 Moléson        
ourdata 13 p STARTED 59557420 27.8gb 10.0.111.24 Moléson        
ourdata 7  p STARTED 59559316 27.7gb 10.0.111.22 Roteck         
ourdata 23 p STARTED 59581922 27.8gb 10.0.111.24 Moléson        
ourdata 0  p STARTED 59562572 27.8gb 10.0.111.21 Hohgant        

The table is mostly:

            SHOW CREATE TABLE doc.ourdata             
-----------------------------------------------------
 CREATE TABLE IF NOT EXISTS "doc"."ourdata" (       
    "bucket" STRING,                                
    "cell" STRING,                                  
    "cell_" STRING INDEX USING FULLTEXT WITH (      
       analyzer = 'simple'                          
    ),                                              
    "data" OBJECT (DYNAMIC) AS (                    
       "i_accuracy" LONG,                           
       "i_devicetime" LONG,                         
       "i_tzoffset" LONG,                           
       "t_deviceid" STRING,                         
       "t_filename_" STRING,                        
       "t_ip" STRING,                               
       "t_lastseen" STRING                          
    ),                                              
    "ingestid" STRING,                              
    "rowid" STRING,                                 
    "shape" GEO_SHAPE INDEX USING GEOHASH WITH (    
       distance_error_pct = 0.025,                  
       precision = '10.0m'                          
    ),                                              
    PRIMARY KEY ("cell", "bucket", "rowid")         
 )                                                  
 CLUSTERED INTO 32 SHARDS                           
 WITH (                                             
    "blocks.metadata" = false,                      
    "blocks.read" = false,                          
    "blocks.read_only" = false,                     
    "blocks.write" = false,                         
    column_policy = 'dynamic',                      
    number_of_replicas = '0',                       
    "recovery.initial_shards" = 'quorum',           
    refresh_interval = 0,                           
    "routing.allocation.enable" = 'all',            
    "routing.allocation.total_shards_per_node" = -1,
    "translog.disable_flush" = false,               
    "translog.flush_threshold_ops" = 2147483647,    
    "translog.flush_threshold_period" = 1800000,    
    "translog.flush_threshold_size" = 209715200,    
    "translog.interval" = 5000,                     
    "translog.sync_interval" = 5000,                
    "unassigned.node_left.delayed_timeout" = 60000, 
    "warmer.enabled" = true                         
 )

Re: primary key, each ingestid is 1:1 with a bucket and cell is empty. So, in this table, the primary key depends almost entirely on the rowid which is an md5 of lots of row-level data + a nanosecond timestamp (so the md5 is effectively a random key).

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
matrivcommented, Apr 10, 2017

@nicerobot After investigation we found out that we can do some performance optimization for the != part of the query only if the column involved is created with the NOT NULL constraint. The PR that does it (https://github.com/crate/crate/pull/5289) is merged to master and will be available with the next feature release.

Thanks again for your detailed report!

1reaction
mfusseneggercommented, Apr 3, 2017

Okay, I had hoped it would be related to query caching or something. But it seems like != is the problem. As your first query shows 9b4.. has 867072963 matches, more than twice as much as 3a9..

So for 9b4... the != part of the query has to do more work. We’ll take a closer look at it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Optimize SQL Queries Performance - Best Practices
What Are Some Best Practices for Optimizing SQL Query Performance? · Define Your Requirements · Reduce Table Size · Simplify Joins · Use...
Read more >
Top 10 Tips Improve SQL Query Performance - Bigscal
Use Exists instead of Sub Query; Use Proper join instead of subqueries; Use “Where” instead of “Having” a clause; Apply index on necessary ......
Read more >
Performance Tuning SQL Queries | Advanced SQL - Mode
SQL tuning is the process of improving SQL queries to accelerate your servers performance. It's general purpose is to reduce the amount of...
Read more >
Query Performance - an overview | ScienceDirect Topics
Query performance is dependent upon appropriate selection of indexes; indexes may have to be tuned after analyzing queries that give poor performance in...
Read more >
Query optimization techniques in SQL Server: tips and tricks
In this blog post we will show you step by step some tips and tricks for successful Query optimization techniques in SQL Server....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found