question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation should consider a dataset's privacy data

See original GitHub issue

Splitting up issues from #88

The current code implementation does not look at the dataset_references field in privacy declarations for evaluations. It just looks at data categories, data use, data subjects and a data qualifier.

The current dataset has an interesting hierarchical format so we want to make sure that we define the evaluation behavior well.

dataset:
  - fides_key: demo_users_dataset
    name: Demo Users Dataset
    data_categories: ["user.provided.identifiable"]
    data_qualifiers: [ "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"]
    description: Data collected about users for our analytics system.
    collections:
      - name: users
        description: User information
        data_qualifier: "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"
        data_categories:
              - user.provided.identifiable
        fields:
          - name: first_name
            description: User's first name
            data_qualifier: "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"
            data_categories:
              - user.provided.identifiable.name

dataset, dataset_collection and dataset_collection_field contain possible data qualifier(s) and data categories which makes the evaluation a little tricky. I’ll add here which things we need to be clear on:

  1. What specific resource does the user want to evaluate We discussed this in #88 and it does feel like with the current hierarchy, it’s not clear which resource exactly should be evaluated. Fields in each level could yield different results in evaluation so I think we should evaluate each and each follows some sort of hierarchy.

  2. How does inheritance work It makes sense that the each resource should inherit from it’s closest parent when a field is not defined. What im not 100% sure on is whether it should inherit qualifiers or categories from the privacy declaration. Basically we just need to define whether the other fields in the privacy declaration should have any impact on evaluations of the data set.

  3. Are implicit defaults problematic in evaluations If the evaluation model follows some sort of inheritance then it might be problematic to have implicit defaults which are not obvious to a user. In our code we default qualifiers to aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified, but what if you wanted to define a qualifier at the collection level which should apply to all fields?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
NevilleScommented, Oct 14, 2021

FWIW, in the nuance of your comment you’re also suggesting to make it so data_qualifier is always singular which I’m generally pretty OK with too. This would make it the same for System, Dataset, DatasetCollection, and DatasetField.

I suppose there’s value in allowing this declaration:

collection:
  - name: "foo"
    data_categories: ["circle", "square"]
    data_qualifiers: ["red", "blue"]

But it’s less clear here, right? Is that saying the collection has red circles and blue squares? Or is it saying that the collection has circles and squares that may be red or blue? Both are reasonable interpretations and we have to pick a winner to decide if the policy says you aren’t allowed to have blue squares!

You avoid this issue by forcing the qualifier to be singular like this:

collection:
  - name: "foo"
    data_categories: ["circle", "square"]
    data_qualifier: "red"

This disallows the annotation (which, in fairness, seems fair to allow) but makes it much clearer what we’re doing. That edge case feels like you could support it in a different way, and then the singular qualifier works for the 90% of cases and avoids the potential footgun ambiguity

0reactions
ThomasLaPianacommented, Oct 15, 2021

I like it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Collection + Evaluation - People + AI Guidebook
Determine the type of data needed to train your model. You'll need to consider predictive power, relevance, fairness, privacy, and security.
Read more >
How to evaluate the quality of the synthetic data - Amazon AWS
This evaluation would reveal if the datasets compared are statistically similar. If they aren't, then we'll have an understanding of which ...
Read more >
Data privacy framework to manage risks in large datasets - IBM
IBM researchers present an end-to-end framework to manage privacy risks - especially for large scale data sets - and execute a data privacy...
Read more >
The Importance Of Evaluating Datasets For AI Development
When evaluating data, businesses should consider four characteristics. Transparency. Do you know the source of the data?
Read more >
Evaluation of Synthetic Data for Privacy-Preserving Machine ...
A basic assumption is that privacy is endangered if the artificial rows in synthetic data are very close or equal to the rows...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found