Scala Common Enrich: add PII Enrichment
See original GitHub issueDisclaimer: Snowplow Analytics Ltd make no claims that the use of this enrichment will ensure or help to ensure compliance with the EU’s GDPR or ePrivacy regulations, and we will not be liable for any failure to comply with GDPR. Your use of this enrichment is governed by Apache License Version 2.0, January 2004.
The PII Enrichment lets you pseudonymize all fields in self-describing events and contexts that might contain PII.
The configuration JSON for this enrichment contains two sub-objects:
pii
specifies the datapoint(s) from the Snowplow event which may represent PIIstrategy
defines how the enrichment should handle making the PII safe
Here is an example configuration:
{
"enabled": true,
"parameters": {
"pii": [
{
"pojo": {
"field": "user_id"
}
},
{
"json": {
"field": "contexts",
"schemaCriterion": "iglu:com.acme/email_sent/jsonschema/1-*-*",
"jsonPath": "$.emailAddress"
}
}
],
"strategy": {
"pseudonymize": {
"hashFunction": "SHA-256"
}
}
}
}
To go through each of these sections in more detail:
pii
Specify an array of pii
, namely properties in the enriched event which could represent PII. Each property is identified by its source: either pojo
if the datapoint comes from the Snowplow enriched event POJO, or json
if the datapoint comes from a self-describing JSON inside one of the three JSON fields.
For pojo
, the field name must be specified. The field name will be ignored if it is not one of the following whitelisted PII fields:
user_id
user_ipaddress
user_fingerprint
domain_userid
network_userid
ip_organization
ip_domain
tr_orderid
ti_orderid
mkt_term
mkt_content
se_category
se_action
se_label
se_property
mkt_clickid
refr_domain_userid
domain_sessionid
For json
, you must specify the field name as either unstruct_event
, contexts
or derived_contexts
. You must then provide two additional fields:
schemaCriterion
lets you specify the self-describing JSON you are looking in for the given JSON field. You can specify only the SchemaVer MODEL (e.g. 1-), MODEL plus REVISION (e.g. 1-1-) or a full MODEL-REVISION-ADDITION version (e.g. 1-1-1)jsonPath
lets you provide the JSON Path statement to navigate to the field inside the JSON that you want to pseudonymize
strategy
The strategy
section lets you configured precisely how the PII is handled by the enrichment.
Currently the only supported strategy is pseudonymize
, which has one configuration options:
hashFunction
specifies the hash to apply to the properties identified by the pii
array. Supported values for the hashFunction
are:
MD2
, the 128-bit algorithm MD2 (not-recommended due to performance see RFC6149)MD5
, the 128-bit algorithm MD5SHA-1
, the 160-bit algorithm SHA-1SHA-256
, 256-bit variant of the SHA-2 algorithmSHA-384
, 384-bit variant of the SHA-2 algorithmSHA-512
, 512-bit variant of the SHA-2 algorithm
With psuedonymization, note that the specified property in the enriched event POJO or self-describing event or context will be hashed using the hashFunction
and then the newly hashed value will replace (i.e. overwrite) the prior unhashed value in the POJO or JSON.
The limitations of this approach are discussed below.
Example
Imagine an event where:
user_id
is set to John Smith
The contexts
array includes:
{
"schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
"data": {
"subject": "Sensitive information",
"emailAddress": "john@acme.com"
}
}
Following processing by the PII Enrichment with the configuration provided above:
user_id
would be mutated to:
ED014A19BB67A85F9C8B1D81E04A0E7101725BE8627D79D02CA4F3BD803F33CF3B8FED53E80D2A12C0D0E426824D99D110F0919298A5055EFFF040A3FC091518
The relevant context would become:
{
"schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
"data": {
"subject": "Sensitive information",
"emailAddress": "D63227AB419893C2483E7B8F5584AC49305191CAC19531E2F8F87C3F303B5F325B470AC51E307680E4B767E9DC685CBE025B1ADC4EA8A986EFD20BFD7E4B55E9"
}
}
Limitations
In support of compliance with GDPR and ePrivacy, we strongly recommend that you familiarize yourself with the following limitations of the enrichment. This is a non-exhaustive list of limitations.
Only supports strings
Because the enrichment mutates each property’s value in place, replacing it with a hash string, it only works if the property’s value is already typed as a string. If the value is not a string, it will be ignored by the enrichment.
Can cause downstream JSON Schema validation to fail
Remember that this enrichment:
- Only supports hashing, not format-preserving encryption, and
- Mutates each property’s value in place
Therefore, it is possible for the updated value to cause downstream validation, such as that performed by the RDB Loader, to fail. This will typically be because the length or format of the hashed value conflicts with the that of the original value.
Is lossy
The properties processed by this enrichment are hashed, not encrypted, and are mutated in place. The original value is therefore not recoverable without re-processing the raw collector logs.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:13 (13 by maintainers)
Top GitHub Comments
done!
Would supporting different strategies at a more granular level (pii level) make sense?
e.g. names in a custom context which don’t appear often so we can afford to use sha-512 but there are email addresses in every event so we limit ourselves to sha-1
Plus we could encrypt certain fields and hash others.
Additionally that would help with constraint validation, i.e. I can use the hash function that is the most appropriate wrt the field’s max length
Finally that would give more flexibility regarding “collision prevention”.