question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scala Common Enrich: add API Request Enrichment

See original GitHub issue

The API Request Enrichment lets you perform dimension widening on a Snowplow event via your own proprietary http(s) API.

The configuration JSON for this enrichment contains four sub-objects:

  1. inputs specifies the datapoint(s) from the Snowplow event to use as keys when performing your API lookup
  2. api defines how the enrichment can access your API
  3. outputs lets you tune how you convert the returned JSON into one or more self-describing JSONs ready to be attached to your Snowplow event
  4. cache improves the enrichment’s performance by storing values retrieved from the API

Here is an example configuration:

{
  "enabled": true,
  "parameters": {
    "inputs": [
      {
        "key": "user",
        "pojo": {
          "field": "user_id"
        }
      },
      {
        "key": "user",
        "json": {
          "field": "contexts",
          "schemaCriterion": "iglu:com.snowplowanalytics.snowplow/client_session/jsonschema/1-*-*",
          "jsonPath": "$.userId"
        }
      },
      {
        "key": "client",
        "pojo": {
          "field": "app_id"
        }
      }
    ],
    "api": {
      "http": {
        "method": "GET",
        "uri": "http://api.acme.com/users/{{client}}/{{user}}?format=json",
        "timeout": 5000,
        "authentication": {
          "httpBasic": {
            "username": "xxx",
            "password": "yyy"
          }
        }
      }
    },
    "outputs": [ {
      "json": {
        "jsonPath": "$.record",
        "schema": "iglu:com.acme/user/jsonschema/1-0-0" 
      }
    } ],
    "cache": {
      "size": 3000,
      "ttl": 60
    }
  }
}

To go through each of these sections in more detail:

inputs

Specify an array of inputs to use as keys when performing your API lookup. Each input consists of a key and a source: either pojo if the datapoint comes from the Snowplow enriched event POJO, or json if the datapoint comes from a self-describing JSON inside one of the three JSON fields. The key can be referred to later in the api.http.uri property.

For pojo, the field name must be specified. A field name which is not recognized as part of the POJO will be ignored by the enrichment.

For json, you must specify the field name as either unstruct_event, contexts or derived_contexts. You must then provide two additional fields:

  • schemaCriterion lets you specify the self-describing JSON you are looking for in the given JSON field. You can specify only the SchemaVer MODEL (e.g. 1-), MODEL plus REVISION (e.g. 1-1-) or a full MODEL-REVISION-ADDITION version (e.g. 1-1-1)
  • jsonPath lets you provide the JSON Path statement to navigate to the field inside the JSON that you want to use as the input

The lookup algorithm is short-circuiting: the first match for a given key will be used.

api

The api section lets you configure how the enrichment should access your API. At the moment only http is supported, with this option covering both HTTP and HTTPS - the protocol on the uri field will determine which to use. Currently only GET is supported as the HTTP method for the lookup.

For the uri field, specify the full URI including the protocol. You can attach a querystring to the end of the URI. You can also embed the keys from your inputs section in the URI, by wrapping the key in {{}} brackets thus:

"uri": "http://api.acme.com/users/{{client}}/{{user}}?format=json"

If a key required in the uri was not found in any of the inputs, then the lookup will not proceed, but this will not be flagged as a failure.

Currently the only supported authentication option is http-basic: provide a username and/or a password for the enrichment to use to connect to your API using basic access authentication. Some APIs use only the username or password field to contain an API key; in this case, set the other property to the empty string "".

If your API is unsecured (because for example it is only accessible from inside your private subnet, or using IP address whitelisting), then configure the authentication section like so:

"authentication": { }

outputs

This enrichment assumes that your API returns a JSON, which will contain one or more entities that you want to add to your event as derived contexts. Within the outputs array, each entry is a json sub-object that contains a jsonPath configuration field that lets you specify which part of the returned JSON you want to add to your enriched event. $ can be used if you want to attach returned JSON as is.

If the JSON Path specified cannot be not found within the API’s returned JSON, then the lookup (and thus the overall event) will be flagged as a failure.

The enrichment adds the returned JSON into the derived_contexts field within a Snowplow enriched event. Because all JSONs in the derived_contexts field must be self-describing JSONs, use the schema field to specify the Iglu schema URI that you want to attach to the event.

Example:

GET http://api.acme.com/users/northwind-traders/123?format=json
{
  "metadata": {
    "whenCreated": 1448371243,
    "whenUpdated": 1448373431
  },
  "record": {
    "name": "Bob Thorpe",
    "id": "123"
  }
}

With this configuration:

"outputs": [ {
  "json": {
    "jsonPath": "$.record",
    "schemaUri": "iglu:com.acme/user/jsonschema/1-0-0" 
  }
} ]

This would be added to the derived_contexts array:

{
  "schema": "iglu:com.acme/user/jsonschema/1-0-0",
  "data": {
    "name": "Bob Thorpe",
    "id": "123"
  }
}

The outputs array must have at least one entry in it.

cache

A Snowplow enrichment can run many millions of time per hour, effectively launching a DoS attack on a data source if we are not careful. The cache configuration attempts to minimize the number of lookups performed.

The cache is an LRU (least-recently used) cache, where less frequently accessed values are evicted to make space for new values. The uri with all keys populated is used as the key in the cache. Configure the cache as follows:

  • size is the maximum number of entries to hold in the cache at any one time
  • ttl is the number of seconds that an entry can stay in the cache before it is forcibly evicted. This is useful to prevent stale values from being retrieved in the case that your API can return different values for the same key over time

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:28 (28 by maintainers)

github_iconTop GitHub Comments

1reaction
alexanderdeancommented, Mar 4, 2016
0reactions
alexanderdeancommented, Apr 28, 2016

Yes please @chuwy! Let’s implement it. I need to do a new Iglu Central release for the Clearbit tutorial anyway…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enrichment and batch processing in Snowplow - datamindedbe
To enrich your data in your own custom way, use the generic JavaScript enrichment or the API Request enrichment. It is not recommended...
Read more >
Message enrichment with Kafka Streams - sap1ens blog
Message enrichment is a standard stream processing task and I want to show different options Kafka Streams provides to implement it properly ...
Read more >
Streams DSL - Apache Kafka
A step-by-step guide for writing a stream processing application using the DSL is provided below. For a complete list of available API functionality,...
Read more >
Data Enrichment in Flink SQL using HTTP Connector For Flink
This enrichment step usually involves polling data from an external system. Very often, this data can only be accessed via REST API. Our...
Read more >
Everything about Snowplow Analytics - Aswin Kumar Rajendran
During the common enrichment process, the data received from Collector(s) is mapped according to our Canonical Event Model. You can view the Enriched...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found