question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KIS tasks in 0.15.1 RC2 sometimes duplicate rows with the same dimension values

See original GitHub issue

Affected Version

0.15.1 RC2

Description

CSV of Results: query-943282d6-3b66-4881-8080-355b09600870.zip

Note that there are 2 rows for timestamp 2019-08-09T02:38:00.000Z, even though the dimensions have the exact same value. In fact, for some reason, the rows are fully duplicated for that time period.

This reflects in charts as values are doubled and it happens pretty often. We noticed it when we saw random spikes in charts when using the aggregated data source. These spikes happen whenever druid duplicates the rows

image

SQL

SELECT *
 FROM "kube-metrics-minute" 
 WHERE accountid='52c7e28eebb26a86fb1505b5' AND  
 "__time" >= TIMESTAMP '2019-08-09 02:37:00' AND "__time" < TIMESTAMP '2019-08-09 02:40:00'  
 AND cluster='ds' AND etype='container' AND namespace='default' AND name='app-alertproc' AND pod='alertproc-deployment-65f59b6b8-gc57m' 

Generated query (shown in router web console)

{
  "queryType": "scan",
  "dataSource": {
    "type": "table",
    "name": "kube-metrics-minute"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2019-08-09T02:37:00.000Z/2019-08-09T02:40:00.000Z"
    ]
  },
  "virtualColumns": [],
  "resultFormat": "compactedList",
  "batchSize": 20480,
  "limit": 9223372036854776000,
  "order": "none",
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "accountid",
        "value": "52c7e28eebb26a86fb1505b5",
        "extractionFn": null
      },
      {
        "type": "selector",
        "dimension": "cluster",
        "value": "ds",
        "extractionFn": null
      },
      {
        "type": "selector",
        "dimension": "etype",
        "value": "container",
        "extractionFn": null
      },
      {
        "type": "selector",
        "dimension": "namespace",
        "value": "default",
        "extractionFn": null
      },
      {
        "type": "selector",
        "dimension": "name",
        "value": "app-alertproc",
        "extractionFn": null
      },
      {
        "type": "selector",
        "dimension": "pod",
        "value": "alertproc-deployment-65f59b6b8-gc57m",
        "extractionFn": null
      }
    ]
  },
  "columns": [
    "__time",
    "accountid",
    "alloc_cpu",
    "alloc_mem",
    "alloc_pods",
    "cap_cpu",
    "cap_mem",
    "cap_pods",
    "cluster",
    "count",
    "cpu",
    "etype",
    "image",
    "labels",
    "lim_cpu",
    "lim_mem",
    "mem_limit",
    "mem_used",
    "name",
    "namespace",
    "node",
    "node_pool",
    "owner",
    "owner_kind",
    "pod",
    "req_cpu",
    "req_mem",
    "rxBytes",
    "txBytes"
  ],
  "legacy": false,
  "context": {
    "sqlQueryId": "f39f0b78-bb97-4c81-b9dd-7e0e2abf0b46"
  },
  "descending": false,
  "granularity": {
    "type": "all"
  }
}

Supervisor task spec:

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "kube-metrics-minute",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [
            "accountid",
            "cluster",
            "name",
            "namespace",
            "etype",
            "labels",
            "owner",
            "owner_kind",
            "node",
            "image",
            "pod",
            "node_pool"
          ]
        }
      }
    },
    "metricsSpec": [
      {
        "name": "count",
        "type": "count"
      },
      {
        "name": "req_cpu",
        "fieldName": "req_cpu",
        "type": "floatMax"
      },
      {
        "name": "req_mem",
        "fieldName": "req_mem",
        "type": "longMax"
      },
      {
        "name": "lim_cpu",
        "fieldName": "lim_cpu",
        "type": "floatMax"
      },
      {
        "name": "lim_mem",
        "fieldName": "lim_mem",
        "type": "longMax"
      },
      {
        "name": "cap_cpu",
        "fieldName": "cap_cpu",
        "type": "floatMax"
      },
      {
        "name": "cap_mem",
        "fieldName": "cap_mem",
        "type": "longMax"
      },
      {
        "name": "cap_pods",
        "fieldName": "cap_pods",
        "type": "floatMax"
      },
      {
        "name": "alloc_cpu",
        "fieldName": "alloc_cpu",
        "type": "floatMax"
      },
      {
        "name": "alloc_mem",
        "fieldName": "alloc_mem",
        "type": "longMax"
      },
      {
        "name": "alloc_pods",
        "fieldName": "alloc_pods",
        "type": "floatMax"
      },
      {
        "name": "cpu",
        "fieldName": "cpu",
        "type": "doubleMax"
      },
      {
        "name": "rxBytes",
        "fieldName": "rxBytes",
        "type": "longMax"
      },
      {
        "name": "txBytes",
        "fieldName": "txBytes",
        "type": "longMax"
      },
      {
        "name": "mem_used",
        "fieldName": "mem_used",
        "type": "longMax"
      },
      {
        "name": "mem_limit",
        "fieldName": "mem_limit",
        "type": "longMax"
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": {"type": "period", "period": "PT4H"},
      "queryGranularity": "MINUTE"
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxRowsPerSegment": 5000000
  },
  "ioConfig": {
    "topic": "kube-metrics",
    "consumerProperties": {
      "bootstrap.servers": "xx"
    },
    "taskCount": 1,
    "replicas": 2,
    "taskDuration": "PT4H"
  }
}

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:22 (21 by maintainers)

github_iconTop GitHub Comments

1reaction
jon-weicommented, Aug 9, 2019

With Kafka ingestion, it’s possible that rollup is not 100%, due to the incremental publishing, e.g.:

  • Kafka task gets a row like {"timestamp":"2019-08-09 14:00:00", "dim1":"hello", "metric":1}
  • Kafka task hits a memory/row/time limit and does an incremental publish
  • If after the incremental publish, another row {"timestamp":"2019-08-09 14:00:00", "dim1":"hello", "metric":2} arrives, it won’t be rolled up into the first already published segment, and a non-aggregating query would return both of those rows

If you run an exact aggregating query like:

SELECT accountid, SUM(cpu) 
FROM "kube-metrics-minute" 
WHERE accountid='52c7e28eebb26a86fb1505b5' AND  
 "__time" >= TIMESTAMP '2019-08-09 14:00:00' AND "__time" < TIMESTAMP '2019-08-09 16:30:00' AND cluster='ds' AND etype='container' AND namespace='default' AND name='app-alertproc' AND pod='alertproc-deployment-65f59b6b8-gc57m' 
GROUP BY accountid

does the sum on CPU return the expected value?

You could compare the results of that query across the minute/second/30m datasources and see if they match, to rule out imperfect rollup from incremental publishes.

0reactions
repnopcommented, Jul 29, 2020

Hi, I’m having the same kind of problem and came across this issue except that in my case the data isn’t duplicated, only the timestamp, and (being very new to druid) I’m not sure I can tell what the solution was here? I’m aggregating values by longMax for one-minute intervals and the duplicate data is causing a pretty big issue since I’m trying to determine a rate & graph it. I attempted to set the intermediate persist interval higher to see if that would help but it doesn’t seem like that’s the case. also attempted to turn on compaction but I feel like I must have done that incorrectly since I don’t think the task has actually run once, but since the data is real time I’m not even sure that helps in my case since we want the most up to date data as possible and having such large spikes that dwarf the rest of the graph is making it pretty unreadable (as pandas is adding the two values with the same timestamp together). The data is being consumed from a Kafka topic which has entries thrown onto it every ~30 seconds at the moment for testing. Any help would be appreciated, thank you!

edit: turns out I was accidentally using longsum instead of longmax in my query (its still longMax in the schema) which was combining the values, not pandas, so the returned values are OK now but I’d still like to avoid having duplicate data stored in the segments if possible. As for the compaction task, it looks like its started running but it hasn’t removed any of the duplicate values

Read more comments on GitHub >

github_iconTop Results From Across the Web

Removing Duplicate Data with LOD Calculations
Use a FIXED expression to remove the duplicate data. It is necessary to know which dimension in the data source is unique for...
Read more >
Multiple async tasks causing duplicates on SQL Server insert ...
The problem is occasionally this data is inserted at the exact same time which causes a duplicate row to get inserted.
Read more >
If you thought you had perfect rollups before, you might have ...
In Apache Druid, you can roll up duplicate rows into a single row ... A rollup includes rows with the same timestamp and...
Read more >
Fixing Duplicate Records in Tableau - OneNumber
Sometimes these duplicates can be solved by creating a join with multiple conditions (i.e. joining on Store ID and Date) but sometimes they...
Read more >
Finding Duplicate and Repeated Rows to Clean Data - SILOTA
A common mechanism for defending against duplicate rows in a database table is to put a unique index on the column. However, at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found