Missing DefaultDimensionSpec in Druid query when querying with multiple Lookup dimensions.
See original GitHub issueWhen we query Maha with multiple lookup dimensions in the select fields we are seeing some of the fields returned as nulls in the results. The underlying druid query issued did not have these lookups listed in the default dimension specs.
To elaborate further The issued maha query is of the following format. Maha Query
{
"cube": "test_cube",
"rowsPerPage": 1000,
"selectFields": [
{
"field": "Adunit ID"
},
{
"field": "Adunit Name" //Lookup based on Adunit ID
},
{
"field": "Adgroup ID"
},
{
"field": "Adgroup Name" //Lookup based on Adgroup ID
},
{
"field": "Adserver Requests"
}
],
"filterExpressions": [
{
"field": "Publisher ID",
"operator": "=",
"value": "xxxxxxxxxAAAAAABBBBBBBBBB"
},
{
"field": "Day",
"operator": "Between",
"from": "2020-09-17",
"to": "2020-09-17"
}
]
}
Output
{
"header": {
"cube": "test_cube",
"fields": [
{
"fieldName": "Adunit ID",
"fieldType": "DIM"
},
{
"fieldName": "Adunit Name",
"fieldType": "DIM"
},
{
"fieldName": "Adgroup ID",
"fieldType": "DIM"
},
{
"fieldName": "Adgroup Name",
"fieldType": "DIM"
},
{
"fieldName": "Adserver Requests",
"fieldType": "FACT"
}
],
"maxRows": 1000,
"debug": {}
},
"rows": [
[
"dddddddddddAdunitI1dddddddddddddd",
"dddddddddddAdunitI1 Name dddddddddddddd",
null, // _Missing adgroup_id_
"dddddddddddAdgroup1 Name dddddddddddddd",
0
],
[
"dddddddddddAdunitId2ddddddddddddd",
"dddddddddddAdunitId2 Name ddddddddddddd",
null, // _Missing adgroup_id_
"dddddddddddAdgroup2 Name dddddddddddddd",
1
]
],
"curators": {}
}
Druid Query Created by Maha
{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "test_cube"
},
"intervals": {
"type": "intervals",
"intervals": [
"2020-09-17T00:00:00.000Z/2020-09-18T00:00:00.000Z"
]
},
"virtualColumns": [],
"filter": {
"type": "and",
"fields": [
{
"type": "or",
"fields": [
{
"type": "selector",
"dimension": "__time",
"value": "2020-09-17",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "UTC",
"granularity": {
"type": "none"
},
"asMillis": false
}
}
]
},
{
"type": "selector",
"dimension": "pubId",
"value": "xxxxxxxxxAAAAAABBBBBBBBBB"
}
]
},
"granularity": {
"type": "all"
},
"dimensions": [ // No adgroup_id added here.
{
"type": "default",
"dimension": "adunitId",
"outputName": "Adunit ID",
"outputType": "STRING"
},
{
"type": "extraction",
"dimension": "adgroupId",
"outputName": "Adgroup Name",
"outputType": "STRING",
"extractionFn": {
"type": "registeredLookup",
"lookup": "adgroup_names",
"retainMissingValue": false,
"replaceMissingValueWith": "UNKNOWN"
}
},
{
"type": "extraction",
"dimension": "adunitId",
"outputName": "Adunit Name",
"outputType": "STRING",
"extractionFn": {
"type": "registeredLookup",
"lookup": "adunit_names",
"retainMissingValue": false,
"replaceMissingValueWith": "UNKNOWN"
}
}
],
"aggregations": [
{
"type": "longSum",
"name": "Adserver Requests",
"fieldName": "adserverRequests"
}
],
"postAggregations": [],
"limitSpec": {
"type": "default",
"columns": [],
"limit": 10000000
},
"context": {
"groupByStrategy": "v2",
"applyLimitPushDown": "false",
"implyUser": "internal_user",
"priority": 10,
"userId": "internal_user",
"uncoveredIntervalsLimit": 1,
"groupByIsSingleThreaded": true,
"timeout": 900000,
"queryId": "9292389f-2a7f-4e12-a39a-6f727097ab92"
},
"descending": false
}
Upon debugging further I stumbled upon the variable factRequestCols(Set of Strings) at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L353 which is being passed on to method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 where druid queries dimension specs are being created in getDimensions method based on factRequestCols passed.
I am not quite sure I understand the logic of factRequestCols set creation here but adgroup_id dimension is not getting included in the resulting set because of which it is not getting added to Druid query dimension spec as well.
I locally overrode the code by passing queryContext.factBestCandidate.requestCols to getDimensions method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 and it fixed the issue and started populating the dimension spec in druid query as well as had adgroup_id values in the resulting out.
Output after the change
{
"header": {
"cube": "platform_performance_cube",
"fields": [
{
"fieldName": "Adunit ID",
"fieldType": "DIM"
},
{
"fieldName": "Adunit Name",
"fieldType": "DIM"
},
{
"fieldName": "Adgroup ID",
"fieldType": "DIM"
},
{
"fieldName": "Adgroup Name",
"fieldType": "DIM"
},
{
"fieldName": "Adserver Requests",
"fieldType": "FACT"
}
],
"maxRows": 1000,
"debug": {}
},
"rows": [
[
"dddddddddddAdunitI1dddddddddddddd",
"dddddddddddAdunitI1 Name dddddddddddddd",
"dddddddddddAdgroup1dddddddddddddd",
"dddddddddddAdgroup1 Name dddddddddddddd",
0
],
[
"dddddddddddAdunitId2ddddddddddddd",
"dddddddddddAdunitId2 Name ddddddddddddd",
"dddddddddddAdgroup2dddddddddddddd",
"dddddddddddAdgroup2 Name dddddddddddddd",
1
]
],
"curators": {}
}
Could you please help me with the context of factRequestCols set creation and also let me know if the logic of the factRequestCols set creation or anything else needs to be changed to include the missing dimensions.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10
Top GitHub Comments
Got it. Thank you!
@upendrareddy Unless you have a dim driven use case (e.g. entity management in the UI, e.g. Campaign Management view), they should all be fact driven so I don’t see any issues with doing it for all queries.