[SIP-78] Proposal for Advanced Types
See original GitHub issue[SIP-78] Proposal for Advanced Types
Motivation
Our users are leveraging Superset to explore and visualize network security logs. These logs often contain IP, CIDR range, port, hostname, and/or protocol fields, among others. While such fields might be stored as simple INTEGERS or VARCHARS in the backend database, these fields also have a “advanced” type which, if leveraged, can provide additional context and improve the user experience.
User-Friendly Display
A field’s advanced type can influence how that field is interacted with by the users. Consider, for example, an IP field. For performance reasons, we store IP values as INTEGERs in our DW e.g. the IP “127.0.0.1” would be stored as its integer value 2130706433. However, the string representation is substantially more human-readable and, as such, we wish to display IP fields in dot-decimal notation in visualizations and filters. This numeric format is fairly standard (link)and having this display functionality would be valuable in superset in case the underlying DB does not support this conversion functionality out of the box. As another example, a Port field stored as an integer (e.g., 443) could be displayed as a protocol name where such mapping is known (e.g., HTTPS).
Applicable Operators for filters
Certain operators (e.g., LIKE
, =
, <
, IS NULL
, IN
, etc.) may be valid for a database type, but may not be applicable to a advanced type. For example, an organization might decide that the LIKE
operator is not relevant to Port fields. Ideally, we want to be able to configure a list of applicable operators that will appear for a field of a certain advanced type.
Input Validation
Advanced type can be associated with (configurable) input validation and type conversion rules. For example, an input validator for a Port field could validate that the filter string is a parsable number between 0 and 65536, or that it contains known port names such as HTTPS, SSH, FTP.
Human Readable input values
As with displaying the human readable format of a value to the user (i.e the dot notation of an IP) users should be able to enter the more readable form of a value and have it be applied properly to a query. This would mean a user could create a filter such as {col: "ip_col", op: "IN", val: [“1.1.1.1”, “1.1.1.2”]}
and have the query be executed as ip_col IN [16842752, 16842753]
. The functionality for a user to simply enter {col: "ip_col", op: "IN", val: [16842752, 16842753]}
should also be preserved.
Behavior specified in the config
The code to specify the behavior of each advanced type should be defined outside of the superset codebase (i.e the config). This way those who wish to implement a advanced type may do so freely without having to make any code changes to superset.
Proof of concept
Figure 1: a video displaying a proof of concept for this proposal. a version of the modified filters popover is shown. The user inputs both the human readable and numeric values for ports and the conversion is shown. Also it is shown that the user cannot save an invalid value, abc
in this case. The user then saves and submits a valid filter and runs the query. the query is then displayed.
Proposed Changes
We propose that the concept of an “advanced type” be introduced to dataset columns in Superset and that this advanced type be leveraged to allow display formats, applicable operators, and validation rules to be added to fields of a particular advanced type. Advanced types would be a purely optional attribute which could be specified on columns; if a advanced type is not specified for a particular column, there would be no changes to how a column is displayed, what operators are available, or what validation is performed on filter inputs.
Adding the advanced type to dataset
Introduce the concept of a advanced type on dataset columns
- Add a advanced_type attribute to the BaseColumn model. This column only needs to be a nullable VARCHAR.
- Add a “Advanced Type” field to the “Edit Datasource” modal (Figure 1) where dataset creators/editors can set the advanced type for a column. This should be restricted to applicable advanced types
Figure 2: Edit Datasource modal with Advanced Type field
Allow applicable operator sets to be defined for each advanced type
The Explorer UI has an ad-hoc filter which lets the user enter a condition on a column, a value and an operator (>, =, LIKE, IS NULL, etc).
Currently all operators are available to the user. We propose returning a list of applicable operators in the response payload. The list of applicable operators should be configurable for each custom advanced type. For maximum flexibility, the applicable operators should be tied to the advanced type.
Figure 3: Current list of operators (left) vs customized list of operators (right)
Before | After |
---|---|
Allow input validation and value conversion on advanced type
Currently any input string can be entered into an ad-hoc filter. We propose exposing a server REST endpoint to validate input strings and also return the “normalized” representation. The normalized representation should be accessible to the user so they can verify the conversion. This could be displayed via a tooltip in order to minimize UI changes to the control.
Example inputs
- Time range: “Last day” -> “2021-03-16T21:57:33.000Z - 2021-03-17T21:57:33.000Z”
- Numeric IP: “192.168.0.1” -> 3232235521
- Numeric IP: 3232235521 -> 3232235521
- Port: “HTTPS” -> “443”
- Port: “8080” -> “8080”
- Numeric CIDR: “1.1.1.1/16” -> {16842752 - 16908287}
An input validator for a PORT could validate that the string is a parsable number between 0 and 65536 or that it contains known port names such as HTTPS, SSH, FTP. An input validator for a DOMAIN could validate that the string is parsable as a domain and contains only alphanumeric characters, periods (.), and dashes (-).
It should be possible to add additional input validators via a configuration like config.py.
Figure 3: Error message for a filter value that fails validation for a field of type Port
Allow custom behavior for each operator on a advanced type
Ad-hoc filters currently work as would be expected on the backend: a filter is translated directly into an SQL statement that matches the filter, so the filter {col: "ip_col", op: "IN", val: "1.1.1.1/16"}
is translated to the statement ip_col IN '1.1.1.1/16'
. Based on the idea of advanced types, we want the value '1.1.1.1/16'
to first be converted to a decimal range (16842752 - 16908287) as covered above, but just doing the conversion is not enough for this range type as we would get something like this in the best case: ip_col IN [16842752, 16842753, ..., 16908287]
. We would more likely want this: ip_col >= 16842752 AND ip_col <= 16908287
. In order to achieve this, we propose making it possible to specify custom behavior for each defined operator associated with an advanced type. This would allow a lot of flexibility for each type.
Allow for custom display values
Columns with advanced types should (optionally) be displayed in their human readable format (i.e dot decimal notation for IPs even though they are stored as integers). In order to accomplish this, we should generate a list of user-friendly values based on the target column and append it to the result set or replace the original column with these new values. This could be accomplished in the filter
in superset/views/core.py
. We should optimally clean up some technical debt in the backend code by porting it to the v1 API, and while doing so we can introduce the additional functionality needed for this feature (two birds with one stone).
New or Changed Public Interfaces
Changes to the config
In order to implement advanced logic for advanced types we propose adding a collection of AdvancedType objects. These will define the available advanced types and their functionalities. The proposed class will be as follows are as follows:
description:
- a description of the usage of the advanced type
verbose_name:
- name used for displaying purposes, e.g. “IP and CIDR (v4)”
Applicatable_types
- a list of
GenericDataType
types that this advanced type applies to. Using the earlier example of IPs stored as ints, that advanced type would only function on integer value columns
type_translation:
- a function that accepts unparsed values and return: a status indicating whether the provided value was valid, the parsed value(s) based on advanced type, valid operators, and formatted value(s)
filter_translation:
- a function that accepts a SqlaColumn, FilterOperator, and an unparsed value (or values). The function calls the advanced_type_translations function (for appropriate advanced type), and returns a valid [SqlaExpression].
- The purpose of these functions is to take in a valid filter operator and a column and return a valid sqla expression that can be used to form the query.
Model changes
BaseColumn model
- Add advanced type field (Nullable VARCHAR)
Change SqlaTable’s get_sql_query function inside the filter loop so that if a advanced type is present on a column, the operator logic is overridden by an advanced type’s operator logic (via callback defined in advanced_type_filter_translations). This callback’s response will then be appended to the where_and list to produce the query.
Addition of the v1/advanced_types/ endpoints
In order to use any of this logic in the front end we will need to define a set of API endpoints for advanced types. Under this section we would define the following endpoints:
[GET] v1/advanced_type/convert
- Calls the appropriate advanced_type_translation function as defined in the config for the given advanced type.
- Parameters:
- advanced type
- a set of value(s) to be validated/converted
- Returns:
- The output of the function retrieved from advanced_type_translation
[GET] v1/advanced_type/types
- Get the list of implemented advanced types and their descriptions. This list would be defined by the intersection of the set of keys for the advanced_type_translations and advanced_type_filter_translations dictionaries. This would be used in order to populate lists to select advanced types for a given column.
- Parameters:
- None
- Returns:
- A list of implemented advanced types and their descriptions
UI changes
The AdhocFilterEditPopoverSimpleTabContent will need to be modified. If an advanced type is present on a selected column, we use the v1/advanced_type/convert
endpoint to retrieve: an input’s valid/invalid status, the list of operators to display, and the currently selected comparator’s converted value. The valid_operators
list will need to override the current operators list defined by the isOperatorRevelant
function. A tooltip component will also need to be added to the component to display the returned converted value. No behavioral changes should be present if a column without a advanced type is selected.
Migration Plan and Compatibility
A database migration is required. This will only effect the BaseColumn model. We will need to add a varchar field to the columns table.
in order to maintain compatibility we will insure the advanced type column in nullable as well as hiding changes behind the ENABLE_ADVANCED_TYPES feature flag.
Rejected Alternatives
no alternatives have been strongly considered at this time
Note
Please note that the API endpoints are subject to change during the development process. This was the first logical attempt
Issue Analytics
- State:
- Created 2 years ago
- Reactions:14
- Comments:7 (7 by maintainers)
Top GitHub Comments
Thanks for the feedback @betodealmeida @rusackas I really appreciate the interest!
So first addressing the idea of having a plugin system, we are 100% on board and had that in mind as well. We were just unsure if introducing this idea might be a separate SIP, extra feature or something long those lines. This could also address the idea of having some of these validators / types / etc be pre-canned and but also extensible as you both have mentioned, I think this is certainly a good idea.
in in addition
v1/business_type/types
will be used to populate a select control to populate a drop-down to select the business type. just so users don’t get into weird situations selecting types that don’t exist and such.I was unaware about the changes to
BaseColumn
so implementing this on the newColumn
model does make sense to me.at this time I do not have a ton of insight into how this fits into known/proposed/speculated plans for Superset’s semantic layer but am willing to take on any feedback in regards to this.
Thanks again for the feedback!
This is great! A couple comments:
BaseColumn
is going away (https://github.com/apache/superset/issues/14909 and https://github.com/apache/superset/pull/17543) in the next few weeks, so I wonder if we should wait a little and implement this directly in the newColumn
model.