question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scala Common Enrich: add PII Enrichment

See original GitHub issue

Disclaimer: Snowplow Analytics Ltd make no claims that the use of this enrichment will ensure or help to ensure compliance with the EU’s GDPR or ePrivacy regulations, and we will not be liable for any failure to comply with GDPR. Your use of this enrichment is governed by Apache License Version 2.0, January 2004.

The PII Enrichment lets you pseudonymize all fields in self-describing events and contexts that might contain PII.

The configuration JSON for this enrichment contains two sub-objects:

  • pii specifies the datapoint(s) from the Snowplow event which may represent PII
  • strategy defines how the enrichment should handle making the PII safe

Here is an example configuration:

{
  "enabled": true,
  "parameters": {
    "pii": [
      {
        "pojo": {
          "field": "user_id"
        }
      },
      {
        "json": {
          "field": "contexts",
          "schemaCriterion": "iglu:com.acme/email_sent/jsonschema/1-*-*",
          "jsonPath": "$.emailAddress"
        }
      }
    ],
    "strategy": {
      "pseudonymize": {
        "hashFunction": "SHA-256"
      }
    }
  }
}

To go through each of these sections in more detail:

pii

Specify an array of pii, namely properties in the enriched event which could represent PII. Each property is identified by its source: either pojo if the datapoint comes from the Snowplow enriched event POJO, or json if the datapoint comes from a self-describing JSON inside one of the three JSON fields.

For pojo, the field name must be specified. The field name will be ignored if it is not one of the following whitelisted PII fields:

  • user_id
  • user_ipaddress
  • user_fingerprint
  • domain_userid
  • network_userid
  • ip_organization
  • ip_domain
  • tr_orderid
  • ti_orderid
  • mkt_term
  • mkt_content
  • se_category
  • se_action
  • se_label
  • se_property
  • mkt_clickid
  • refr_domain_userid
  • domain_sessionid

For json, you must specify the field name as either unstruct_event, contexts or derived_contexts. You must then provide two additional fields:

  • schemaCriterion lets you specify the self-describing JSON you are looking in for the given JSON field. You can specify only the SchemaVer MODEL (e.g. 1-), MODEL plus REVISION (e.g. 1-1-) or a full MODEL-REVISION-ADDITION version (e.g. 1-1-1)
  • jsonPath lets you provide the JSON Path statement to navigate to the field inside the JSON that you want to pseudonymize

strategy

The strategy section lets you configured precisely how the PII is handled by the enrichment.

Currently the only supported strategy is pseudonymize, which has one configuration options:

hashFunction specifies the hash to apply to the properties identified by the pii array. Supported values for the hashFunction are:

  • MD2, the 128-bit algorithm MD2 (not-recommended due to performance see RFC6149)
  • MD5, the 128-bit algorithm MD5
  • SHA-1, the 160-bit algorithm SHA-1
  • SHA-256, 256-bit variant of the SHA-2 algorithm
  • SHA-384, 384-bit variant of the SHA-2 algorithm
  • SHA-512, 512-bit variant of the SHA-2 algorithm

With psuedonymization, note that the specified property in the enriched event POJO or self-describing event or context will be hashed using the hashFunction and then the newly hashed value will replace (i.e. overwrite) the prior unhashed value in the POJO or JSON.

The limitations of this approach are discussed below.

Example

Imagine an event where:

user_id is set to John Smith

The contexts array includes:

{
  "schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
  "data": {
    "subject": "Sensitive information",
    "emailAddress": "john@acme.com"
   }
}

Following processing by the PII Enrichment with the configuration provided above:

user_id would be mutated to:

ED014A19BB67A85F9C8B1D81E04A0E7101725BE8627D79D02CA4F3BD803F33CF3B8FED53E80D2A12C0D0E426824D99D110F0919298A5055EFFF040A3FC091518

The relevant context would become:

{
  "schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
  "data": {
    "subject": "Sensitive information",
    "emailAddress": "D63227AB419893C2483E7B8F5584AC49305191CAC19531E2F8F87C3F303B5F325B470AC51E307680E4B767E9DC685CBE025B1ADC4EA8A986EFD20BFD7E4B55E9"
   }
}

Limitations

In support of compliance with GDPR and ePrivacy, we strongly recommend that you familiarize yourself with the following limitations of the enrichment. This is a non-exhaustive list of limitations.

Only supports strings

Because the enrichment mutates each property’s value in place, replacing it with a hash string, it only works if the property’s value is already typed as a string. If the value is not a string, it will be ignored by the enrichment.

Can cause downstream JSON Schema validation to fail

Remember that this enrichment:

  1. Only supports hashing, not format-preserving encryption, and
  2. Mutates each property’s value in place

Therefore, it is possible for the updated value to cause downstream validation, such as that performed by the RDB Loader, to fail. This will typically be because the length or format of the hashed value conflicts with the that of the original value.

Is lossy

The properties processed by this enrichment are hashed, not encrypted, and are mutated in place. The original value is therefore not recoverable without re-processing the raw collector logs.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
yalisassooncommented, Jan 1, 2018

done!

1reaction
BenFradetcommented, Nov 7, 2017

Would supporting different strategies at a more granular level (pii level) make sense?

e.g. names in a custom context which don’t appear often so we can afford to use sha-512 but there are email addresses in every event so we limit ourselves to sha-1

Plus we could encrypt certain fields and hash others.

Additionally that would help with constraint validation, i.e. I can use the hash function that is the most appropriate wrt the field’s max length

Finally that would give more flexibility regarding “collision prevention”.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Snowplow Enrichment - Scaladex
Snowplow Enrich is a set of applications and libraries for processing raw Snowplow events into validated and enriched Snowplow events. It consists of...
Read more >
snowplow/snowplow r106-acropolis on GitHub - NewReleases.io
This release brings a new version of the PII enrichment for both the batch and real-time pipelines. Blog post. Scala Common Enrich. Add...
Read more >
PII enrichment MD5 / SHA-1 salt values required for Redshift ...
Hi guys, Just validating my config against the new schema for pii_enrichment_config/2-0-0 and noticed that salt is required under pseudonymize.
Read more >
how to capture clickstream events in Kafka with Snowplow
In contrast to the most popular web analytics platform (Google ... to add the appropriate certificates to the Scala Snowplow collector, ...
Read more >
Lynx: A knowledge-based AI service platform for content ...
Contracting is a common activity in companies, but managing contracts ... Lynx Documents can be grouped in Collections and eventually enriched with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found