[CT-1599] [Feature] dbt should know more semantic information
See original GitHub issueLearnings from the past year
Before we get into what this issue is proposing to add to dbt-core, we want to make sure that the community understands what this functionality is building towards. Over the coming years, we envision dbt expanding beyond its current scope to provide users with the best experience in creating their data knowledge graph, comprising three layers:
- The Physical Layer: the underlying object, location where data is stored
- The Logical Layer: the method of applying dbt transformations to create datasets
- The Semantic Layer: the mapping of business noun/verbs onto the logical/physical layer
dbt will tie all of these into a cohesive experience but recognizing them as distinct components provides an experience that is easy to adopt but flexible for many use cases.

Today, dbt has tightly coupled logical and physical layers and a new semantic layer that has initially focused on metrics. But in the past year, we’ve learned that in order to build the broader vision, we have to lay the groundwork of a fully featured semantic layer. Our goals are:
- Create a loosely coupled abstraction on top of the logical layer where users can define and consume semantic logic (definitions, relationships, etc.) in their dbt projects.
- Maximize efficiency and minimize the error of dbt asset definition for knowledge builders.
- Maximize discoverability of dbt assets for data consumers.
- Expand the range of tools that can interact with the Semantic Layer and the use cases it can serve.
What problems are we solving?
Here’s what we aim to accomplish in the near- to medium-term with this proposed scope:
- Provide users with the foundation for a broader and more flexible semantic layer.
- Decrease the effort and difficulty of defining metrics, namely dimension definition.
- Ensure that finding the right semantic constructs is possible at scale.
Introducing the entity
To both solve these problems and lay the foundation for a semantic future, we are proposing a new node type called an entity. entities are top-level nodes within dbt-core and represent the declared interface to a specific model, containing additional metadata (semantic information) that can’t live within models. Each entity will be associated with a distinct business noun/verb and allow for dbt users to create a single universal semantic model across their entire project.
To quote our lovely @jtcohen6 , entities are for everyone. We envision a world where there might be teams of different humans managing the logical layer and the semantic layer given their interest and expertise.
What are our building blocks?
In order to solve these problems, we need to figure out what our building blocks are and whether we need to add anything new:
model: A data transformation that provides the business-conformed representation of the dataset, or more specifically a discrete unit of transformation. This is the building block component of the logical layer.metric: An aggregation of data (defined on top of an entity) that represents a measurable indicator for the business.- With the introduction of
entities, metrics need to change so that they can be built on top of entities as opposed to models. But this is good news! Not only does it allow metrics to inherit a lot of the defined information (making metrics more DRY), but it is also a forcing function to make metrics more flexible.
- With the introduction of
entity: A new abstraction loosely coupled with a model that allows users to map business concepts onto the underlying logical model.- This is the new “building block” that we’re proposing! We believe this will unlock new functionality and give us the best framework on which to continue building the Semantic Layer.
- In other words, the entity construct allows you to define all of your business concepts as first-class representations in your dbt project so all this information can be consumed downstream in your analytics tools of choice.
Fitting in with our story
The ever-present theme of dbt Labs’ story is taking the best practices of software engineering and converting them to the data world. In the case of entities, we’re taking the best practice of API design and contracts between consumers and producers. Software engineering teams don’t expose the underlying table to their consumers – they bundle it in a format that they know matches the consuming behavior. So too should dbt users employ those principles to build their semantic layers.
The entity spec
| Property | Description | Example | Required |
|---|---|---|---|
| name | The name of the entity | orders | Yes |
| model | The name of the model that the entity is dependent on | fact_orders | Yes |
| description | The description of the entity | Lorem Ipsum | No |
| dimensions | The list of dimensions and their properties associated with the entity. | See below | No |
| dimensions.include | Either * to denote all columns or a list of columns that will be inherited | * or [column_a, column_b] | No |
| dimensions.exclude | If * is set at include_columns, this is a list of columns to be excluded from that list | [column_a] | No |
| dimensions.name | The name of the dimension | order_location | No |
| dimensions.column_name | The name of the column in the model if not 1:1. Serves as mapping | location | No |
| dimensions.data_type | The data type of the dimension | string | No |
| dimensions.description | Description of the dimension | The location of the order | No |
| dimensions.default_timestamp | Setting datetime dimension as default for metrics | false | No |
| dimensions.time_grains | Acceptable time grains for the datetime dimension | [day, week, month] | No |
| dimensions.primary_key | Whether this dimension is part of the primary key | false | No |
What would an example look like?
# models/semantic_layer/product/schema.yml
entity:
- name: orders
model: ref('fact_orders')
dimensions:
- include: *
- exclude: [column_c]
## An example where we don't want it to inherit everything the model
- name: organization
model: ref('dim_organization')
dimensions:
- name: organization_id
type: primary_key
- name: some_dimension_name
column_name: some_column_name_that_doesnt_match
Functional requirements
- Entities will participate in the dbt DAG as a distinct node type
- Entity nodes should be accessible in the dbt Core compilation context via:
- the
graph.entitiesvariable
- the
- Entity nodes should be emitted into the
manifest.jsonartifact - Entity should work with partial parsing
- Metric nodes should be supported in node selection and should be selectable with the
entity:selector- When listing nodes, existing graph operators (
+,&, etc) should be supported
- When listing nodes, existing graph operators (
- Entities should be surfaced in the dbt Docs website
Similar to metrics, dbt Core itself will not evaluate or materialize entities. These are virtualized abstractions that are exposed to downstream tools/packages for use for the purpose of discovery/understanding and dynamic dataset generation. Properties like data type are also useful for Semantic Layer integrations.
Just as dbt_metrics exists to interact with metrics, we’ll provide a method of interacting with entities that will evolve in this way until it is stable and bundled with dbt-core. The exact format will come in a future issue.
The updated metric spec
With the addition of entities:
| Property | Description | Example | Required | Changed |
|---|---|---|---|---|
| name | The name of the metric | new_customers | Yes | No |
| label | The human readable name of the metric | New Customers | Yes | No |
| entity | The name of the entity that the metric is dependent on | customers | Yes | Yes |
| calculation_method | The method of calculation (aggregation or derived) that is applied to the expression | count_distinct | Yes | No |
| expression | The expression to aggregate/calculate over | user_id | Yes | No |
| description | The description of the metric | Lorem Ipsum | No | No |
| dimensions.include | Either * to denote all columns or a list of columns that will be inherited | * or [column_a, column_b] | No | Yes |
| dimensions.exclude | If * is set at include_columns, this is a list of columns to be excluded from that list | [column_a] | No | Yes |
| timestamp | The time-based component of the metric | signup_date | No | No |
| time_grains | One or more “grains” at which the metric can be evaluated. | [day, week, month, quarter, year] | No | No |
| window | A dictionary for aggregating over a window of time. | {count: 14, period: day} | No | No |
| filters | A list of filters to apply before calculating the metric | See below | No | No |
| config | Optional configurations for calculating this metric | {treat_null_values_as_zero: true} | No | No |
| meta | Arbitrary key/value store | {team: Finance} | No | No |
What would an example look like?
metrics:
- name: new_customers
label: New Customers
entity: customers
calculation_method: count_distinct
expression: user_id
dimensions:
include: *
exclude:
- column_a
- column_b
What’s changed:
- Metrics are now built on top of a
entityinstead of amodel - The
timestampproperty is now optional. - The
time_grainsproperty is now optional. If atimestampis provided that does not havetime_grainsassociated with it, we will now provide defaults ofday, week, month, year - The
dimensionsproperty has been split into two properties:include: This property is either set to*which inherits all of the dimensions from the entity or a list of columns that limits the inputexclude: Ifincludeis configured as*then this property can be used to exclude the listed dimensions from the dimension list
How does this impact what you’ve currently built with metrics?
Similar to how we handled changes to the metric spec in the release of dbt-core 1.3.0, we will support the old behavior for a full minor version with backwards compatibility. After that, we will fully deprecate the old properties.
This means that your metric definitions will remain viable for a full minor version upgrade - so if this is launched as part of 1.5.0 then you don’t need to migrate until 1.6.0. That being said, we hope you migrate over earlier for the advantages of using entities 🙂.
Let’s talk about joins
All right, it’s time to address the complex elephant in the room: joins. Supporting joins has been one of the top feature requests that we’ve heard since adding metrics to dbt and we understand why. Joins will allow you to fit metrics into your overall data model (be it Kimball, Inmon, etc.) and expand the ease with which your teams can adopt metrics.
But they’re not part of this issue. And that’s for a good reason: joins are hard.
We’re committed to adding joins in the future but are very aware that supporting this functionality effectively means building two-thirds of a query planner ourselves. And it’s a query planner that needs the underlying information that we will add loosely coupled entities.
This is only further complicated by our goal of providing a universal semantic layer across all entities defined in your project, as opposed to an explore-based semantic layer where relationships may need to be defined multiple times. In this world, our query construction process has to be able to traverse the semantic graph to determine whether a query is not only viable but also if it makes semantic sense. To quote the original metrics issue:
It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average or adding two distinct counts together. You get a number back… but it’s not a useful or meaningful result.
With all this said, we are committed to adding joins in the future but are taking our time to ensure what we launch is right for analytics engineers, the data consumers, and our integration partners who will build on top of it.
Describe alternatives you’ve considered
Including this semantic information in the model config
We explored a number of different designs during the ideation process and one of the main alternatives was storing this type of semantic information inside of the model configuration. Ultimately we determined it wasn’t the path forward for a number of reasons:
- Storing semantic information in the model config creates a tightly coupled experience between the logical and semantic layers, which would make it difficult to enable our vision of different groups of humans simultaneously contributing to different layers
- The goal of the model config is to serve as the implementation detail, whereas the goal of the entity is to be a declared interface. Coupling them together reduces the flexibility of the declared interface
- Coupled entities cannot move/migrate to a new underlying model without significant effort, whereas a loosely coupled interface can be pointed at a new model easily.
Are you interested in contributing to this feature?
Absolutely.
Footnotes
- A kind soul asked what this new node type would mean for
exposures. We imagineexposuresserving a very similar, if somewhat more important, role that they do today in that they would represent the consuming experiences sitting on top of entities and metrics!
Issue Analytics
- State:
- Created 10 months ago
- Reactions:36
- Comments:10 (6 by maintainers)

Top Related StackOverflow Question
Love the engagement we’re seeing here! Let me see if I can do my best to address concerns in the following areas:
Fuzzy Added Value
These are fair concerns! Lets try to address the added value first.
Concretely, the real value you’ll get today from defining an entity on top of a model is the increased flexibility to define metrics. Metrics built on top of entities can inherit defined properties, such as the
time_grainsassociated with a dimension, the default timestamp associated with an entity, or the list of dimensions for that entity. Adding this kind of inheritance behavior into models (ie representing semantic concepts within the model config) begins to really blur that line between defined interface and implementation detail, which overloads the model config to @cafzal 's above comment.This workstream/proposal is really about creating the building blocks that will enable the functionality of Tomorrow™️, such as joins.
Models/Entities Are 1:1?
This is a great callout! When we say loosely coupled, we’re referring to the fact that the relationships can be swapped/detached without impacting either of the nodes in question. IE, this loose coupling allows users to detach any semantic model and move it over to any new/edited implementation.
This is a totally reasonable hesitancy and part of the feedback we’re hoping to get from commenters such as yourself. We feel reasonably confident that an entity as its own first-class representation inside of dbt is powerful because of the workflows it could enable outside of the AE workflow (more integration partners, easier interfaces for business users to add to the project, etc).
Obviously this comes with a degree of additional complexity that we believe is worth the tradeoff. What we’d love to get feedback on is ways to improve this developer experience - properties, behaviors, etc etc. What are some potential ways that this concept can fit more easily into your workflow?
Moving Along To Concrete Properties - Specifics Around Datatype!
Our vision here was that datatype would be an entirely metadata property that could be provided to BI tools but we’re very open to admitting we’re wrong on this one! The problem we’re attempting to resolve is that column/dimensions datatypes are not introspected from the db as part of the
manifestso in order to surface them up to our integration partners we have to rely on the users running some command that generates thecatalogsuch asdbt docs generate. The hope here is that we provide multiple places for the user to configure data types (model, entity, etc) so that the integration partners in question aren’t as reliant on a user generating the catalog.I am willing to be convinced that this is really a property of the implementation detail (ie model config) and not the declared interface, even if it somewhat diverges from some of the API design principles that we’re trying to learn from. Especially with the work being done around constraints inside core, this feels like a reasonable thing to push down to the logical layer.
@olivierdupuis Thanks for your comment. You’re on 🎯 about entities as semantic building blocks.
The main advantage of structuring this design around entities is its ability to detach itself from the logical layer (models) and re-attach itself to any new implementation. This unlocks interesting possibilities by allowing data teams to map sources to entities and gradually standardizing/automating metric and even entity definition (in line with Abhi’s vision). You can see this trend toward (semi-)automation with the dbt metric packages that Fivetran and Houseware released.
We recognize @PedramNavid 's point about entities adding more complexity for analytics engineers to keep track of. There’s a definite tradeoff to adding a new layer to the dbt project. But we’ve weighed the options: it doesn’t make sense to overload models, and while treating metrics as first-class objects helps with lineage, modularity, etc., defining them is too inefficient right now (speaking of having to keep track of columns…). We’ll keep iterating on the developer experience to reduce the overhead of managing different assets.
Introducing entities will help unlock metric definition at scale, lower the barrier of entry for data consumers to get involved (both in entity definition and analysis), and start paving this path to standardized packages.