Composite and partial unique constraints
See original GitHub issue_See previous discussion on https://github.com/frictionlessdata/goodtables-py/pull/252#issuecomment-369336642_
The use case is where you want to validate that multiple fields are unique as a whole. Given the following table:
id_1 | id_2 |
---|---|
1 | 1 |
1 | |
1 |
Both id_1
and id_2
are nullable, but we want (id_1, id_2)
to be unique. All the following rows couldn’t be added to this table:
id_1 | id_2 |
---|---|
1 | 1 |
1 | |
1 |
But these would be valid:
id_1 | id_2 |
---|---|
1 | 2 |
2 | 1 |
2 | |
2 |
If this were SQL, I would do this creating the indexes:
CREATE UNIQUE INDEX my_table_unique_keys ON my_table (id_1, id_2);
CREATE UNIQUE INDEX my_table_unique_keys ON my_table (id_1) WHERE id_2 IS NULL;
CREATE UNIQUE INDEX my_table_unique_keys ON my_table (id_2) WHERE id_1 IS NULL;
Currently there’s no way of doing this with Table Schema. For the simple case where we just want (id_1, id_2)
to be unique, we could follow a pattern similar to primaryKeys
, like:
"schema": {
"unique": ["id_1", "id_2"]
}
But it wouldn’t work for partial indexes. In that case, we also need a WHERE clause. Maybe something like:
"schema": {
"unique": [
{ "fields": ["id_1", "id_2"] },
{ "fields": ["id_1"], "where": "id_2 IS NULL" },
{ "fields": ["id_2"], "where": "id_1 IS NULL" },
]
}
But this can grow the implementation complexity pretty quickly, as now we have to handle the where
clause somehow.
Maybe a good middle ground is just configuring what happens when some of the unique fields are null? The default behaviour on a composite index in SQL is that if any field is null, the entire index isn’t used, so there could these two rows would be valid:
id_1 | id_2 |
---|---|
2 | |
2 |
Maybe this behaviour could be controlled by a boolean flag. Something like:
"schema": {
"unique": [
{ "fields": ["id_1", "id_2"], "ignoreNullValues": true },
]
}
The naming can be improved.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:26 (25 by maintainers)
Top GitHub Comments
What follows is a bit long, but hopefully clarifies the conversation
What is being asked here and in https://github.com/frictionlessdata/goodtables-py/pull/252 is a row uniqueness constraint that treats
null
as a regular value. In that sense, it is a unique key that can also serve as a primary key, despite the presence ofnull
, and can apply to both single and multiple fields.For a single field, the requested constraint is met by the following:
null
but not met by:
null
null
Yet this
x
with repeatingnull
is considered unique in PostgreSQL(, MySQL, Oracle, Firebird, SQLite): https://dbfiddle.uk/?rdbms=postgres_11&fiddle=21d04f1b1d151f5d0180a7f753544f95 But not in Microsoft SQL Server: https://dbfiddle.uk/?rdbms=sqlserver_2019l&fiddle=21d04f1b1d151f5d0180a7f753544f95For multiple fields, the requested constraint is met by the following:
null
but not met by:
null
null
Yet this
x, y
with repeating2, null
is considered unique by PostgreSQL(, MySQL, Oracle, Firebird, SQLite): https://dbfiddle.uk/?rdbms=postgres_11&fiddle=5845f4945ba2fcd20ab710530b2348de But not by Microsoft SQL Server: https://dbfiddle.uk/?rdbms=sqlserver_2019l&fiddle=5845f4945ba2fcd20ab710530b2348deSo I believe Table Schema needs to be amended in two ways:
primaryKey
cannot containnull
(i.e. an implicitrequired: true
).null
inprimaryKey
fields (with either explicitrequired: false
in field constraints or a boolean switch onprimaryKey
) and clarify thatnull
should be treated as a regular value (per Microsoft SQL Server).null
as a regular value either by default or with a boolean switch.Hello all! o/
I agree with @roll that primary keys shouldn’t be nullable. It seems to me that something like:
that treats
null
as any other value could solve this. This is different than SQL, where unique constraints aren’t used when some of its components are null, but I don’t see a problem in deviating from that. WDYT?