Evaluation should consider a dataset's privacy data
See original GitHub issueSplitting up issues from #88
The current code implementation does not look at the dataset_references field in privacy declarations for evaluations. It just looks at data categories, data use, data subjects and a data qualifier.
The current dataset has an interesting hierarchical format so we want to make sure that we define the evaluation behavior well.
dataset:
- fides_key: demo_users_dataset
name: Demo Users Dataset
data_categories: ["user.provided.identifiable"]
data_qualifiers: [ "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"]
description: Data collected about users for our analytics system.
collections:
- name: users
description: User information
data_qualifier: "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"
data_categories:
- user.provided.identifiable
fields:
- name: first_name
description: User's first name
data_qualifier: "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"
data_categories:
- user.provided.identifiable.name
dataset, dataset_collection and dataset_collection_field contain possible data qualifier(s) and data categories which makes the evaluation a little tricky. I’ll add here which things we need to be clear on:
-
What specific resource does the user want to evaluate We discussed this in #88 and it does feel like with the current hierarchy, it’s not clear which resource exactly should be evaluated. Fields in each level could yield different results in evaluation so I think we should evaluate each and each follows some sort of hierarchy.
-
How does inheritance work It makes sense that the each resource should inherit from it’s closest parent when a field is not defined. What im not 100% sure on is whether it should inherit qualifiers or categories from the privacy declaration. Basically we just need to define whether the other fields in the privacy declaration should have any impact on evaluations of the data set.
-
Are implicit defaults problematic in evaluations If the evaluation model follows some sort of inheritance then it might be problematic to have implicit defaults which are not obvious to a user. In our code we default qualifiers to
aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
, but what if you wanted to define a qualifier at the collection level which should apply to all fields?
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (16 by maintainers)
Top GitHub Comments
FWIW, in the nuance of your comment you’re also suggesting to make it so
data_qualifier
is always singular which I’m generally pretty OK with too. This would make it the same for System, Dataset, DatasetCollection, and DatasetField.I suppose there’s value in allowing this declaration:
But it’s less clear here, right? Is that saying the collection has red circles and blue squares? Or is it saying that the collection has circles and squares that may be red or blue? Both are reasonable interpretations and we have to pick a winner to decide if the policy says you aren’t allowed to have blue squares!
You avoid this issue by forcing the qualifier to be singular like this:
This disallows the annotation (which, in fairness, seems fair to allow) but makes it much clearer what we’re doing. That edge case feels like you could support it in a different way, and then the singular qualifier works for the 90% of cases and avoids the potential footgun ambiguity
I like it!