Allow for union of query data sources
See original GitHub issueDescription
There are two concepts of union
in Druid. There is UNION ALL
in SQL, which concatenates the results of two or more SQL queries. This exists only within Druid SQL. There is also the notion of a union
data source within a native Druid query. In this instance, Druid will run a query over the raw data of the two or more data sources as if they are one. For this feature request, let us consider this second use case of union.
It is also possible to use a query as a data source. This allows the results of one query to be used as the data source for another query. Let us refer to this kind of data source as a query data source. This is used for nested groupBys and is only currently supported for groupBys.
This feature request is for the ability to union
two or more query data sources. This is effectively combining items 2 and 3 on this page: Datasources.
As a suggestion for implementation, a query could be of the form:
{
"queryType": "groupBy",
"dataSource":
{
"type": "union",
"dataSources": [
{
"type": "query",
"query": {
"type": "groupBy",
...
}
},
{
"type": "query",
"query": {
"type": "groupBy",
...
}
},
]
},
"granularity": "day",
"dimensions": ["country", "device"],
"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "carrier", "value": "AT&T" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "make", "value": "Apple" },
{ "type": "selector", "dimension": "make", "value": "Samsung" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "avg_usage",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "data_transfer" },
{ "type": "fieldAccess", "fieldName": "total_usage" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
"having": {
"type": "greaterThan",
"aggregation": "total_usage",
"value": 100
}
}
Motivation
Currently it is possible to union two or more data sources and currently it is possible to use a query as a data source. However, it is not possible to union two or more query data sources. Thus, this is a logical next step.
Specifically, this allows for aggregations and post aggregations on two distinct result sets that share the same features.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:11
- Comments:5
This would be a very useful feature in certain use cases. +1
Can it be achieved by other native queries?