[SIP-15] Transparent and Consistent Time Intervals
See original GitHub issueMotivation
Some of the core tenants of any BI tool are consistency and accuracy, however we have seen instances where the time interval provides misleading or potentially incorrect results due to the nuances in how Superset currently i) uses different logic depending on the connector, and ii) mixes datetime resolutions.
The time-interval conundrum
A datetime or timestamp (time for short) interval is defined by a start
and end
time and whether the limits are inclusive ([
, ]
) or exclusive ((
, )
) of. This leads to four possible time interval definitions:
[start, end]
[start, end)
(start, end)
(start, end]
The underlying issue is it is not apparent to the user in the UI when they specify a time range which definition Superset is invoking. Sadly the answer depends on the connector (SQLAlchemy or Druid REST API) and in the case of SQLAlchemy engines the underlying structure of the datasource. This leads to several issues:
- Inconsistent behavior between connectors.
- Incorrect aggregations for certain chart types.
The later is especially concerning for chart types which obfuscate the temporal component or for time grains which use aggregation and thus potentially providing incorrect results, i.e., a user may think they are aggregating a week’s worth of data (seven days) whereas the reality is they are only aggregating six or eight days due to the inclusion/exclusion logic.
Druid REST API
The Druid REST API uses the [start, end)
definition (2) for time intervals although it is not explicitly mentioned in their documentation though is defined here.
SQLAlchemy
The SQLAlchemy engines use the [start, end]
definition (1), i.e., the time filter limits are defined via >=
and <=
conditions respectively. Note however unbeknown to the user the filter may behave like (start, end]
(4). Why? The reason is due to the potential mixing of dates and timestamps and using lexicographical order for evaluating clauses, i.e., assume that the time column ds
is defined as a date string, then a filter of the form,
WHERE
ds >= '2018-01-01 00:00:00' AND
ds <= '2018-01-02 00:00:00'
for a set of ds
(datestamp) values of [2018-01-01
, 2018-01-02
] results in,
> SELECT
'2018-01-01' >= '2018-01-01 00:00:00',
'2018-01-01' <= '2018-01-02 00:00:00'
FALSE TRUE
and
> SELECT
'2018-01-02' >= '2018-01-01 00:00:00',
'2018-01-02' <= '2018-01-02 00:00:00'
TRUE TRUE
respectively. Due to the lexicographical order the [start
actually acts like (start
which is probably not what the user expected.
Note this is especially problematic for relative time periods such as Last week
(which is relative to today) if your time column is a datestamp as in most cases the window would only contain six (rather than seven) days of data. Why? Because the the [start, end]
behaves like (start, end]
and the the data associated with the end
date doesn’t exist yet. Additionally making the end
limit inclusive is actually misleading as the times are supposed to be relative to today which implies exclusive of today.
Proposed Change
I propose there are two things we need to solve:
- Consistency: Ensure that all connections and datasources use the same interval definitions and that time columns are cast to a time (timestamps are preferred to strings).
- Transparency: Explicitly call out in the UI what the interval definition is.
Consistency
Which of the four definitions make the most sense? I propose Druid’s definition of [start, end)
(2) makes the most sense as it guarantees to capture the entire time interval regardless of the time resolution in the data, i.e., for SQL a 24 hour interval would be expressed as:
WHERE
time >= TIMESTAMP '2018-01-01 00:00:00' AND
time < TIMESTAMP '2018-01-02 00:00:00'
The reason not to opt for [start, end]
(1) is this 24 hour interval could potentially be expressed as:
WHERE
time >= TIMESTAMP '2018-01-01 00:00:00' AND
time <= TIMESTAMP '2018-01-01 23:59:59'
however it assumes that the finest granularity that time
column is defined is in seconds. In the case of milliseconds it wouldn’t capture most of the last second in the 24 hour period. Also the [start, end)
definition ensures that adjacent time periods are non-under/over-lapping, i.e.,
[2018-01-01 00:00:00, 2018-01-02 00:00:00)
[2018-01-02 00:00:00, 2018-01-03 00:00:00)
Secondly most relative time periods Last day
, Last week
, etc. are from today at 12:00:00 am (exceptions include things like Last 24 hours
which is relative to now). What’s important is that these implicitly are exclusive of the reference time , i.e., we are looking at a previous period, and hence why end)
really makes the most sense.
Finally how to we address lexicographical issue caused by mixing of date and timestamp strings? Given that we explicitly define the time as time (rather than date) we should enforce that all time columns are cast/converted to a timestamp, i.e., in the case of Presto either,
WHERE
DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' AND
DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-01-02 00:00:00'
(preferred) or
WHERE
CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) => '2018-01-01 00:00:00' AND
CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) < '2018-01-02 00:00:00'
work.
Transparency
I sense we need to improve the UI to explicitly call out:
- The interval definition.
- The relative time.
The interval definition
I sense a tooltip here would be suffice simply mentioning that the start is inclusive and the end is exclusive of.
The relative time
Both defaults and custom time periods are relative to some time. The custom time mentions “Relative to today” however it isn’t clear if today means a time (now, 12:00:00 am, etc.) or a date. Furthermore defaults have no mention what the reference time is.
In most instances it’s today at 12:00:00 am and there seems to be merit in explicitly calling this out. Additionally there may be merit in having an asterisk (or similar) when a relative time period is chosen, i.e., Last 7 days *
to help ground the reference.
Note the mocks below are not correct when referring to relative periods where the time unit is less than a day, i.e., for any quantity of second, minute, or hour (say 48 hours) the reference time in now and thus the text should update according to the unit selected.
Concerns
I sense there are potentially three major concerns:
- Migrations.
- Performance.
- Dates vs. datetimes (timestamps)
Migrations
If you asked a producer of a chart which of the four time interval definitions was being adhered to you would get the full gamut of responses, i.e., it’s not evident to them exactly what the current logic is and thus it’s not evident to us how we would migrate time intervals which used an absolute time for either the start or end. I sense the only solution here is to realize that this is a breaking change which though challenging provides mores transparency and consistency in the future. An institution would probably want to inform their customers of such a change via a PSA or similar.
Performance
At Airbnb our primary SQLAlchemy engine is Presto where the underlying table is partitioned by datestamp (denoted as ds
). One initial concern I had was by enforcing the time column to represent a timestamp using a combination of Presto’s date and time functions that a full table-scan would be required, i.e., the query planner would not be able to deduce which partitions to use, which would not be performant.
Running an EXPLAIN on the following query,
SELECT
COUNT(1)
FROM
<table>
WHERE
DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' AND
DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-01-02 00:00:00'
results in a query plan consisting of a filter for only the 2018-01-01
ds
partition,
ScanFilterProject[table = <table>, originalConstraint = (("date_parse"("ds", '%Y-%m-%d') >= "$literal$timestamp"(1514764800000)) AND ("date_parse"("ds", '%Y-%m-%d') < "$literal$t
Cost: {rows: ?, bytes: ?}/{rows: ?, bytes: ?}/{rows: ?, bytes: ?}
LAYOUT: <cluster>
ds := HiveColumnHandle{clientId=<cluster>, name=ds, hiveType=string, hiveColumnIndex=-1, columnType=PARTITION_KEY, comment=Optional.empty}
:: [[2018-01-01]]
which means the Presto engine can correctly deduce which partitions to scan. Note I’m unsure if this holds true for all engines.
Dates vs. Datetimes (Timestamps)
Is there merit in differentiating between dates (2018-01-01
) and datetimes (2018-01-01 00:00:00
)? Dates are discrete whereas timestamps are continuous and thus the perception of the interval may differ. Additionally we think about date intervals we normally think of [start, end]
, rather than [start, end)
. For example Q1 is defined as 1 January – 31 March which is inclusive of both the start and end date, i.e., [01/01, 03/31]
rather than [01/01, 04/01)
.
One could argue that Druid is correct by using the [start, end)
definition as it deals with timestamps whereas SQLAlchemy datasources which are probably weighted towards dates are correct in using the [start, end]
definition (excluding the issue with lexicographical ordering). There may be merit in adding explicit support for both dates and datetimes (Tableau supports both) which would require additional UI changes.
New or Changed Public Interfaces
Added clarity to the time range widget.
New dependencies
None.
Migration Plan and Compatibility
There are no planed migrations however this would be a breaking change.
Rejected Alternatives
None.
to: @betodealmeida @fabianmenges @graceguo-supercat @jeffreythewang @kristw @michellethomas @mistercrunch @timifasubaa @williaster
Issue Analytics
- State:
- Created 5 years ago
- Reactions:15
- Comments:9 (7 by maintainers)
Top GitHub Comments
+1 on relative expressions showing what they evaluate too instantly in the control
About the inclusive
<=
right bound, I also believe it should be exclusive. One way to do proper change management on this would be to:<=
or<
on the control itself<
for future/new charts<=
for all existing chartsThat way:
Here’s one more example where the current behavior can be surprising and undesired:
We built an aggregation pipeline where one day’s data is aggregated to
date_trunc('day', timestamp)
. Since thatdate_trunc
results in a timestamp where the time part is00:00:00
, this results in getting an extra days results into a[start, end]
time range compared to querying the original data when using a date range. Changing to[start, end)
would fix this as well.