Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SIP-15] Transparent and Consistent Time Intervals

See original GitHub issue

Motivation

Some of the core tenants of any BI tool are consistency and accuracy, however we have seen instances where the time interval provides misleading or potentially incorrect results due to the nuances in how Superset currently i) uses different logic depending on the connector, and ii) mixes datetime resolutions.

The time-interval conundrum

A datetime or timestamp (time for short) interval is defined by a start and end time and whether the limits are inclusive ([, ]) or exclusive ((, )) of. This leads to four possible time interval definitions:

[start, end]
[start, end)
(start, end)
(start, end]

The underlying issue is it is not apparent to the user in the UI when they specify a time range which definition Superset is invoking. Sadly the answer depends on the connector (SQLAlchemy or Druid REST API) and in the case of SQLAlchemy engines the underlying structure of the datasource. This leads to several issues:

Inconsistent behavior between connectors.
Incorrect aggregations for certain chart types.

The later is especially concerning for chart types which obfuscate the temporal component or for time grains which use aggregation and thus potentially providing incorrect results, i.e., a user may think they are aggregating a week’s worth of data (seven days) whereas the reality is they are only aggregating six or eight days due to the inclusion/exclusion logic.

Druid REST API

The Druid REST API uses the [start, end) definition (2) for time intervals although it is not explicitly mentioned in their documentation though is defined here.

SQLAlchemy

The SQLAlchemy engines use the [start, end] definition (1), i.e., the time filter limits are defined via >= and <= conditions respectively. Note however unbeknown to the user the filter may behave like (start, end] (4). Why? The reason is due to the potential mixing of dates and timestamps and using lexicographical order for evaluating clauses, i.e., assume that the time column ds is defined as a date string, then a filter of the form,

WHERE
    ds >= '2018-01-01 00:00:00' AND
    ds <= '2018-01-02 00:00:00'

for a set of ds (datestamp) values of [2018-01-01, 2018-01-02] results in,

> SELECT
    '2018-01-01' >= '2018-01-01 00:00:00',
    '2018-01-01' <= '2018-01-02 00:00:00'
FALSE TRUE

and

> SELECT
    '2018-01-02' >= '2018-01-01 00:00:00',
    '2018-01-02' <= '2018-01-02 00:00:00'
TRUE TRUE

respectively. Due to the lexicographical order the [start actually acts like (start which is probably not what the user expected.

Note this is especially problematic for relative time periods such as Last week (which is relative to today) if your time column is a datestamp as in most cases the window would only contain six (rather than seven) days of data. Why? Because the the [start, end] behaves like (start, end] and the the data associated with the end date doesn’t exist yet. Additionally making the end limit inclusive is actually misleading as the times are supposed to be relative to today which implies exclusive of today.

Proposed Change

I propose there are two things we need to solve:

Consistency: Ensure that all connections and datasources use the same interval definitions and that time columns are cast to a time (timestamps are preferred to strings).
Transparency: Explicitly call out in the UI what the interval definition is.

Consistency

Which of the four definitions make the most sense? I propose Druid’s definition of [start, end) (2) makes the most sense as it guarantees to capture the entire time interval regardless of the time resolution in the data, i.e., for SQL a 24 hour interval would be expressed as:

WHERE
    time >= TIMESTAMP '2018-01-01 00:00:00' AND
    time < TIMESTAMP '2018-01-02 00:00:00'

The reason not to opt for [start, end] (1) is this 24 hour interval could potentially be expressed as:

WHERE
    time >= TIMESTAMP '2018-01-01 00:00:00' AND
    time <= TIMESTAMP '2018-01-01 23:59:59'

however it assumes that the finest granularity that time column is defined is in seconds. In the case of milliseconds it wouldn’t capture most of the last second in the 24 hour period. Also the [start, end) definition ensures that adjacent time periods are non-under/over-lapping, i.e.,

[2018-01-01 00:00:00, 2018-01-02 00:00:00)
[2018-01-02 00:00:00, 2018-01-03 00:00:00)

Secondly most relative time periods Last day, Last week, etc. are from today at 12:00:00 am (exceptions include things like Last 24 hours which is relative to now). What’s important is that these implicitly are exclusive of the reference time , i.e., we are looking at a previous period, and hence why end) really makes the most sense.

Finally how to we address lexicographical issue caused by mixing of date and timestamp strings? Given that we explicitly define the time as time (rather than date) we should enforce that all time columns are cast/converted to a timestamp, i.e., in the case of Presto either,

WHERE
    DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' AND
    DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-01-02 00:00:00'

(preferred) or

WHERE
    CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) => '2018-01-01 00:00:00' AND
    CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) < '2018-01-02 00:00:00'

work.

Transparency

I sense we need to improve the UI to explicitly call out:

The interval definition.
The relative time.

The interval definition

I sense a tooltip here would be suffice simply mentioning that the start is inclusive and the end is exclusive of.

The relative time

Both defaults and custom time periods are relative to some time. The custom time mentions “Relative to today” however it isn’t clear if today means a time (now, 12:00:00 am, etc.) or a date. Furthermore defaults have no mention what the reference time is.

screen shot 2018-11-09 at 9 52 10 pm

In most instances it’s today at 12:00:00 am and there seems to be merit in explicitly calling this out. Additionally there may be merit in having an asterisk (or similar) when a relative time period is chosen, i.e., Last 7 days * to help ground the reference.

Note the mocks below are not correct when referring to relative periods where the time unit is less than a day, i.e., for any quantity of second, minute, or hour (say 48 hours) the reference time in now and thus the text should update according to the unit selected.

screen shot 2018-11-09 at 9 56 44 pm screen shot 2018-11-09 at 9 57 24 pm

Concerns

I sense there are potentially three major concerns:

Migrations.
Performance.
Dates vs. datetimes (timestamps)

Migrations

If you asked a producer of a chart which of the four time interval definitions was being adhered to you would get the full gamut of responses, i.e., it’s not evident to them exactly what the current logic is and thus it’s not evident to us how we would migrate time intervals which used an absolute time for either the start or end. I sense the only solution here is to realize that this is a breaking change which though challenging provides mores transparency and consistency in the future. An institution would probably want to inform their customers of such a change via a PSA or similar.

Performance

At Airbnb our primary SQLAlchemy engine is Presto where the underlying table is partitioned by datestamp (denoted as ds). One initial concern I had was by enforcing the time column to represent a timestamp using a combination of Presto’s date and time functions that a full table-scan would be required, i.e., the query planner would not be able to deduce which partitions to use, which would not be performant.

Running an EXPLAIN on the following query,

SELECT
    COUNT(1)
FROM
    <table>
WHERE
    DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' AND
    DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-01-02 00:00:00'

results in a query plan consisting of a filter for only the 2018-01-01 ds partition,

ScanFilterProject[table = <table>, originalConstraint = (("date_parse"("ds", '%Y-%m-%d') >= "$literal$timestamp"(1514764800000)) AND ("date_parse"("ds", '%Y-%m-%d') < "$literal$t
    Cost: {rows: ?, bytes: ?}/{rows: ?, bytes: ?}/{rows: ?, bytes: ?}                                                                                                                                     
    LAYOUT: <cluster>                                                                                                                                                                                        
    ds := HiveColumnHandle{clientId=<cluster>, name=ds, hiveType=string, hiveColumnIndex=-1, columnType=PARTITION_KEY, comment=Optional.empty}                                                               
        :: [[2018-01-01]]

which means the Presto engine can correctly deduce which partitions to scan. Note I’m unsure if this holds true for all engines.

Dates vs. Datetimes (Timestamps)

Is there merit in differentiating between dates (2018-01-01) and datetimes (2018-01-01 00:00:00)? Dates are discrete whereas timestamps are continuous and thus the perception of the interval may differ. Additionally we think about date intervals we normally think of [start, end], rather than [start, end). For example Q1 is defined as 1 January – 31 March which is inclusive of both the start and end date, i.e., [01/01, 03/31] rather than [01/01, 04/01).

One could argue that Druid is correct by using the [start, end) definition as it deals with timestamps whereas SQLAlchemy datasources which are probably weighted towards dates are correct in using the [start, end] definition (excluding the issue with lexicographical ordering). There may be merit in adding explicit support for both dates and datetimes (Tableau supports both) which would require additional UI changes.

New or Changed Public Interfaces

Added clarity to the time range widget.

New dependencies

None.

Migration Plan and Compatibility

There are no planed migrations however this would be a breaking change.

Rejected Alternatives

None.

to: @betodealmeida @fabianmenges @graceguo-supercat @jeffreythewang @kristw @michellethomas @mistercrunch @timifasubaa @williaster

Issue Analytics

State:
Created 5 years ago
Reactions:15
Comments:9 (7 by maintainers)

Top GitHub Comments

4reactions

mistercrunchcommented, Nov 16, 2018

+1 on relative expressions showing what they evaluate too instantly in the control

About the inclusive <= right bound, I also believe it should be exclusive. One way to do proper change management on this would be to:

offer the option to <= or < on the control itself
set the default to < for future/new charts
set value to <= for all existing charts

That way:

backwards compatibility is maintained
option is available
UI is clear about the behavior

3reactions

Fingerzamcommented, Nov 12, 2018

Here’s one more example where the current behavior can be surprising and undesired:

We built an aggregation pipeline where one day’s data is aggregated to date_trunc('day', timestamp). Since that date_trunc results in a timestamp where the time part is 00:00:00, this results in getting an extra days results into a [start, end] time range compared to querying the original data when using a date range. Changing to [start, end) would fix this as well.

Top Results From Across the Web

[SIP-15] Transparent and Consistent Time Intervals #6360

Consistency : Ensure that all connections and datasources use the same interval definitions and that time columns are cast to a time (timestamps ......

[VOTE] SIP-15/15A Proposal for Transparent and Consistent ...

Hi, I would like to call a vote on SIP-15/15A: - https://github.com/apache/incubator-superset/issues/6360 ...

Re: [VOTE] SIP-15/15A Proposal for Transparent and Consistent ...

Re: [VOTE] SIP-15/15A Proposal for Transparent and Consistent Time Intervals ... wrote: > > > Hi, > > > > I would like...

SIP: Session Initiation Protocol (RFC 2543)

The server MAY choose a shorter time interval than that requested by the client, but SHOULD NOT choose a longer one. For INVITE...

cube-sip-15-mt-book.pdf

Unlike the SIP-Expires header, it can contain only a delta-time, which is the current time, plus the session interval from the response.