should we normalize scheme and hostname of URI to be lower?
See original GitHub issuedatasets are defined by their URI
we don’t really spelly out what exactly our URI standard is, apart from reserving the airflow
scheme and ascii
for websites, the scheme and hostname are case insensitive, while everything else is not.
should we normalize scheme and hostname or allow case sensitive differentiation?
my inclination is, we should allow everything to be case sensitive, except possibly the scheme. and the reason is, we don’t know exactly what “hostname” will mean for a dataset. if, for example, it’s a database object, it could be case sensitive. scheme.
if we do implement some normalization, we introduce somewhat of a problem when we do sqlalchemyf lookups by URI because we can’t just do Dataset.uri == uri
; we’d have to normalize the incoming URI value first. One possibility is to split out the scheme into a different column in the db that has CI collation, which would avoid this issue, though it would force you to decompose. but this is messy too.
Issue Analytics
- State:
- Created a year ago
- Comments:24 (24 by maintainers)
Yep. It’s a mess. But I think if we are about to be compliant with OpenLineage and others, being compliant with RFC is best. The worst we can do is invent yet another “our” interpretation of the standard:
Haha nice