question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CT-1599] [Feature] dbt should know more semantic information

See original GitHub issue

Learnings from the past year

Before we get into what this issue is proposing to add to dbt-core, we want to make sure that the community understands what this functionality is building towards. Over the coming years, we envision dbt expanding beyond its current scope to provide users with the best experience in creating their data knowledge graph, comprising three layers:

  • The Physical Layer: the underlying object, location where data is stored
  • The Logical Layer: the method of applying dbt transformations to create datasets
  • The Semantic Layer: the mapping of business noun/verbs onto the logical/physical layer

dbt will tie all of these into a cohesive experience but recognizing them as distinct components provides an experience that is easy to adopt but flexible for many use cases.

Layers

Today, dbt has tightly coupled logical and physical layers and a new semantic layer that has initially focused on metrics. But in the past year, we’ve learned that in order to build the broader vision, we have to lay the groundwork of a fully featured semantic layer. Our goals are:

  • Create a loosely coupled abstraction on top of the logical layer where users can define and consume semantic logic (definitions, relationships, etc.) in their dbt projects.
  • Maximize efficiency and minimize the error of dbt asset definition for knowledge builders.
  • Maximize discoverability of dbt assets for data consumers.
  • Expand the range of tools that can interact with the Semantic Layer and the use cases it can serve.

What problems are we solving?

Here’s what we aim to accomplish in the near- to medium-term with this proposed scope:

  • Provide users with the foundation for a broader and more flexible semantic layer.
  • Decrease the effort and difficulty of defining metrics, namely dimension definition.
  • Ensure that finding the right semantic constructs is possible at scale.

Introducing the entity

To both solve these problems and lay the foundation for a semantic future, we are proposing a new node type called an entity. entities are top-level nodes within dbt-core and represent the declared interface to a specific model, containing additional metadata (semantic information) that can’t live within models. Each entity will be associated with a distinct business noun/verb and allow for dbt users to create a single universal semantic model across their entire project.

To quote our lovely @jtcohen6 , entities are for everyone. We envision a world where there might be teams of different humans managing the logical layer and the semantic layer given their interest and expertise.

Entities

What are our building blocks?

In order to solve these problems, we need to figure out what our building blocks are and whether we need to add anything new:

  • model: A data transformation that provides the business-conformed representation of the dataset, or more specifically a discrete unit of transformation. This is the building block component of the logical layer.
  • metric: An aggregation of data (defined on top of an entity) that represents a measurable indicator for the business.
    • With the introduction of entities, metrics need to change so that they can be built on top of entities as opposed to models. But this is good news! Not only does it allow metrics to inherit a lot of the defined information (making metrics more DRY), but it is also a forcing function to make metrics more flexible.
  • entity: A new abstraction loosely coupled with a model that allows users to map business concepts onto the underlying logical model.
    • This is the new “building block” that we’re proposing! We believe this will unlock new functionality and give us the best framework on which to continue building the Semantic Layer.
    • In other words, the entity construct allows you to define all of your business concepts as first-class representations in your dbt project so all this information can be consumed downstream in your analytics tools of choice.

Fitting in with our story

The ever-present theme of dbt Labs’ story is taking the best practices of software engineering and converting them to the data world. In the case of entities, we’re taking the best practice of API design and contracts between consumers and producers. Software engineering teams don’t expose the underlying table to their consumers – they bundle it in a format that they know matches the consuming behavior. So too should dbt users employ those principles to build their semantic layers.

The entity spec

Property Description Example Required
name The name of the entity orders Yes
model The name of the model that the entity is dependent on fact_orders Yes
description The description of the entity Lorem Ipsum No
dimensions The list of dimensions and their properties associated with the entity. See below No
dimensions.include Either * to denote all columns or a list of columns that will be inherited * or [column_a, column_b] No
dimensions.exclude If * is set at include_columns, this is a list of columns to be excluded from that list [column_a] No
dimensions.name The name of the dimension order_location No
dimensions.column_name The name of the column in the model if not 1:1. Serves as mapping location No
dimensions.data_type The data type of the dimension string No
dimensions.description Description of the dimension The location of the order No
dimensions.default_timestamp Setting datetime dimension as default for metrics false No
dimensions.time_grains Acceptable time grains for the datetime dimension [day, week, month] No
dimensions.primary_key Whether this dimension is part of the primary key false No

What would an example look like?

# models/semantic_layer/product/schema.yml

entity:
  - name: orders
    model: ref('fact_orders')
    dimensions:
       - include: *
       - exclude: [column_c]

    ## An example where we don't want it to inherit everything the model
  - name: organization
    model: ref('dim_organization') 
    dimensions:
      - name: organization_id
         type: primary_key

      - name: some_dimension_name
         column_name: some_column_name_that_doesnt_match		

Functional requirements

  • Entities will participate in the dbt DAG as a distinct node type
  • Entity nodes should be accessible in the dbt Core compilation context via:
    • the graph.entities variable
  • Entity nodes should be emitted into the manifest.json artifact
  • Entity should work with partial parsing
  • Metric nodes should be supported in node selection and should be selectable with the entity: selector
    • When listing nodes, existing graph operators (+&, etc) should be supported
  • Entities should be surfaced in the dbt Docs website

Similar to metrics, dbt Core itself will not evaluate or materialize entities. These are virtualized abstractions that are exposed to downstream tools/packages for use for the purpose of discovery/understanding and dynamic dataset generation. Properties like data type are also useful for Semantic Layer integrations.

Just as dbt_metrics exists to interact with metrics, we’ll provide a method of interacting with entities that will evolve in this way until it is stable and bundled with dbt-core. The exact format will come in a future issue.

The updated metric spec

With the addition of entities:

Property Description Example Required Changed
name The name of the metric new_customers Yes No
label The human readable name of the metric New Customers Yes No
entity The name of the entity that the metric is dependent on customers Yes Yes
calculation_method The method of calculation (aggregation or derived) that is applied to the expression count_distinct Yes No
expression The expression to aggregate/calculate over user_id Yes No
description The description of the metric Lorem Ipsum No No
dimensions.include Either * to denote all columns or a list of columns that will be inherited * or [column_a, column_b] No Yes
dimensions.exclude If * is set at include_columns, this is a list of columns to be excluded from that list [column_a] No Yes
timestamp The time-based component of the metric signup_date No No
time_grains One or more “grains” at which the metric can be evaluated. [day, week, month, quarter, year] No No
window A dictionary for aggregating over a window of time. {count: 14, period: day} No No
filters A list of filters to apply before calculating the metric See below No No
config Optional configurations for calculating this metric {treat_null_values_as_zero: true} No No
meta Arbitrary key/value store {team: Finance} No No

What would an example look like?

metrics:
 - name: new_customers
   label: New Customers
   entity: customers
   calculation_method: count_distinct
   expression: user_id 

   dimensions:
      include: *
      exclude: 
        - column_a
        - column_b

What’s changed:

  • Metrics are now built on top of a entity instead of a model
  • The timestamp property is now optional.
  • The time_grains property is now optional. If a timestamp is provided that does not have time_grains associated with it, we will now provide defaults of day, week, month, year
  • The dimensions property has been split into two properties:
    • include: This property is either set to * which inherits all of the dimensions from the entity or a list of columns that limits the input
    • exclude: If include is configured as * then this property can be used to exclude the listed dimensions from the dimension list

How does this impact what you’ve currently built with metrics?

Similar to how we handled changes to the metric spec in the release of dbt-core 1.3.0, we will support the old behavior for a full minor version with backwards compatibility. After that, we will fully deprecate the old properties.

This means that your metric definitions will remain viable for a full minor version upgrade - so if this is launched as part of 1.5.0 then you don’t need to migrate until 1.6.0. That being said, we hope you migrate over earlier for the advantages of using entities 🙂.

Let’s talk about joins

All right, it’s time to address the complex elephant in the room: joins. Supporting joins has been one of the top feature requests that we’ve heard since adding metrics to dbt and we understand why. Joins will allow you to fit metrics into your overall data model (be it Kimball, Inmon, etc.) and expand the ease with which your teams can adopt metrics.

But they’re not part of this issue. And that’s for a good reason: joins are hard.

We’re committed to adding joins in the future but are very aware that supporting this functionality effectively means building two-thirds of a query planner ourselves. And it’s a query planner that needs the underlying information that we will add loosely coupled entities.

This is only further complicated by our goal of providing a universal semantic layer across all entities defined in your project, as opposed to an explore-based semantic layer where relationships may need to be defined multiple times. In this world, our query construction process has to be able to traverse the semantic graph to determine whether a query is not only viable but also if it makes semantic sense. To quote the original metrics issue:

It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average or adding two distinct counts together. You get a number back… but it’s not a useful or meaningful result.

With all this said, we are committed to adding joins in the future but are taking our time to ensure what we launch is right for analytics engineers, the data consumers, and our integration partners who will build on top of it.

Describe alternatives you’ve considered

Including this semantic information in the model config

We explored a number of different designs during the ideation process and one of the main alternatives was storing this type of semantic information inside of the model configuration. Ultimately we determined it wasn’t the path forward for a number of reasons:

  • Storing semantic information in the model config creates a tightly coupled experience between the logical and semantic layers, which would make it difficult to enable our vision of different groups of humans simultaneously contributing to different layers
  • The goal of the model config is to serve as the implementation detail, whereas the goal of the entity is to be a declared interface. Coupling them together reduces the flexibility of the declared interface
  • Coupled entities cannot move/migrate to a new underlying model without significant effort, whereas a loosely coupled interface can be pointed at a new model easily.

Are you interested in contributing to this feature?

Absolutely.


Footnotes

  1. A kind soul asked what this new node type would mean for exposures. We imagine exposures serving a very similar, if somewhat more important, role that they do today in that they would represent the consuming experiences sitting on top of entities and metrics!

Issue Analytics

  • State:open
  • Created 10 months ago
  • Reactions:36
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
callum-mcdatacommented, Dec 7, 2022

Love the engagement we’re seeing here! Let me see if I can do my best to address concerns in the following areas:

Fuzzy Added Value

These are fair concerns! Lets try to address the added value first.

what would be the added value of defining entities outside their models @olivierdupuis adding a not immediately distinct layer on top of final models @jaypeedevlin (slack)

Concretely, the real value you’ll get today from defining an entity on top of a model is the increased flexibility to define metrics. Metrics built on top of entities can inherit defined properties, such as the time_grains associated with a dimension, the default timestamp associated with an entity, or the list of dimensions for that entity. Adding this kind of inheritance behavior into models (ie representing semantic concepts within the model config) begins to really blur that line between defined interface and implementation detail, which overloads the model config to @cafzal 's above comment.

This workstream/proposal is really about creating the building blocks that will enable the functionality of Tomorrow™️, such as joins.

Models/Entities Are 1:1?

I already think of terminal (model) nodes as “entities” that I’m surfacing for end users. With this in mind, I feel like (based on the current proposal) I would have to double handle all my terminal models to define entities @jaypeedevlin

If there’s a 1:1 relationship between a model and an entity, then aren’t they always going to be tightly coupled? That’s one of the reasons I was curious about entities containing joins (or at least information to facilitate joins), since that would demand a looser coupling. @jaypeedevlin

This is a great callout! When we say loosely coupled, we’re referring to the fact that the relationships can be swapped/detached without impacting either of the nodes in question. IE, this loose coupling allows users to detach any semantic model and move it over to any new/edited implementation.

I already think of terminal (model) nodes as “entities” that I’m surfacing for end users. With this in mind, I feel like (based on the current proposal) I would have to double handle all my terminal models to define entities @jaypeedevlin

This is a totally reasonable hesitancy and part of the feedback we’re hoping to get from commenters such as yourself. We feel reasonably confident that an entity as its own first-class representation inside of dbt is powerful because of the workflows it could enable outside of the AE workflow (more integration partners, easier interfaces for business users to add to the project, etc).

Obviously this comes with a degree of additional complexity that we believe is worth the tradeoff. What we’d love to get feedback on is ways to improve this developer experience - properties, behaviors, etc etc. What are some potential ways that this concept can fit more easily into your workflow?

Moving Along To Concrete Properties - Specifics Around Datatype!

If the configuration of datatype on an entity deviated from the corresponding data_type configured on an underlying model column (if provided), would this result in data type coercions within entity query compilation?

Our vision here was that datatype would be an entirely metadata property that could be provided to BI tools but we’re very open to admitting we’re wrong on this one! The problem we’re attempting to resolve is that column/dimensions datatypes are not introspected from the db as part of the manifest so in order to surface them up to our integration partners we have to rely on the users running some command that generates the catalog such as dbt docs generate. The hope here is that we provide multiple places for the user to configure data types (model, entity, etc) so that the integration partners in question aren’t as reliant on a user generating the catalog.

Or perhaps this metadata really should just live in the logical layer (on the model), and propagate it’s way up to a metric through an entity. Either way - we’d want to be precise in documentation about how this attribute is set and used by consumers.

I am willing to be convinced that this is really a property of the implementation detail (ie model config) and not the declared interface, even if it somewhat diverges from some of the API design principles that we’re trying to learn from. Especially with the work being done around constraints inside core, this feels like a reasonable thing to push down to the logical layer.

4reactions
cafzalcommented, Dec 6, 2022

@olivierdupuis Thanks for your comment. You’re on 🎯 about entities as semantic building blocks.

The main advantage of structuring this design around entities is its ability to detach itself from the logical layer (models) and re-attach itself to any new implementation. This unlocks interesting possibilities by allowing data teams to map sources to entities and gradually standardizing/automating metric and even entity definition (in line with Abhi’s vision). You can see this trend toward (semi-)automation with the dbt metric packages that Fivetran and Houseware released.

We recognize @PedramNavid 's point about entities adding more complexity for analytics engineers to keep track of. There’s a definite tradeoff to adding a new layer to the dbt project. But we’ve weighed the options: it doesn’t make sense to overload models, and while treating metrics as first-class objects helps with lineage, modularity, etc., defining them is too inefficient right now (speaking of having to keep track of columns…). We’ll keep iterating on the developer experience to reduce the overhead of managing different assets.

Introducing entities will help unlock metric definition at scale, lower the barrier of entry for data consumers to get involved (both in entity definition and analysis), and start paving this path to standardized packages.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sergi Gomez ha hecho una publicación en LinkedIn
At Saivo we're seeing it more and more with our customers. ... [CT-1599] [Feature] dbt should know more semantic information · Issue #6379...
Read more >
dbt Semantic Layer - dbt Developer Hub
The dbt Semantic Layer provides the flexibility to define metrics on top of your existing models and then query those metrics and models...
Read more >
Understanding the components of the dbt Semantic Layer
TLDR: The Semantic Layer is made up of a combination of open-source and SaaS offerings and is going to change how your team...
Read more >
dbt Developer Blog
You can find more information on where to do this at the end. The power of a semantic layer on top of a...
Read more >
dbt Pricing Plans
All features in Developer · Add up to 8 seats · One project limit · 5 read-only seats · Up to 2 concurrently...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found