Untangling a messy DAG
This study note will cover basic concepts related to the DAG and steps to consider when optmising a messy DAG
Part of the “Mastering dbt” series. Access to the full Study Guide. Let’s connect on LinkedIn!
Notes from this dbt blog post and this video.
We covered modularity principles in the previous Checkpoint and had a flavour of layering concepts, which transformations belong to each layer, and what a performant DAG looks like.
This study note will cover some basic concepts of DAGs and then go into how to audit a DAG. What are the key factors we should consider when trying to improve our DAGs? Is there such a thing as a perfect DAG?
Ideally, we should create modular DAGs from the start, but in the real world, you will often work on already existing projects, so it’s important to learn how to untangle a messy DAG.
But first… let’s cover some DAG basics.
Basic DAG concepts
Firstly, the acronym DAG stands for Directed Acyclic Graph, which is a graphical representation of the directional relationships between different nodes that don’t form a loop. This concept originated in mathematics but is now widely used in data engineering.
If we zoom into a particular node (or in our case, model) of our DAG, we can identify the upstream and downstream models.
Upstream (left): models that need to be performed prior to the current model.
Downstream (right): outputs of the current model.
Within the data lineage, we also talk about dependencies, which refer to the relationships between different models, packages, or resources that determine the order in which things are built or executed. These dependencies are crucial for ensuring that your data transformations happen in the correct sequence.
For instance, in the model below, we can identify some dependencies:

stg_users and stg_user groups are the parent models to int_users
stg_users and stg_user groups are beign joined to form int_users
stg_orgs and int_users are parent models to dim_users
dim_users is downstream from all the other models.
Untangling a messy DAG
When we refer to a “messy DAG” we mean a DAG that doesn’t follow modularity principles. A messy DAG will often feature:
The same transformation being repeated in multiple models that pull data directly from the source.
A lack of clarity as to where a specific transformation or amendment should go in the DAG.
Poorly performing joins based on multiple columns.
Confusing relationships between models.
These features often result in long running times, scalability challenges, and an overall project that is hard to manage.
It’s important to highlight that there is no such thing as a perfect DAG. The business case and needs are what are going to define what the DAG will look like.
However, there are basic principles that should be followed, like clear layering and avoiding repeating code and poorly performing joins.
Factors to look out for when fixing a messy DAG
These are the foundations of a DAG audit:
Define model types and what they are responsible for (layering concepts defined here and here)
Ensure the model is either:
highly reuseable
a major transformation in the data’s journey to the end model.
So, now, we are going to analyse the factors we should consider to transform the messy DAG below…

Into a modular, better-performing DAG:

Pick a flow to focus on. In real-life scenarios, DAGs may include dozens of models, and it may look overwhelming. Pick a specific flow and start from there.
Look for sources that are joined and cleaned in the same model. Joining of sources should occur in a base layer and be separate from other transformations.
Identify which models are staging or intermediate.
Look for repeated joins of the same tables in multiple models: make that join an intermediate model that can be referred to by all these models.
Identify where the name of a model could be changed to add more clarity
Ensure every source has a staging model (1:1 relationship).
An interesting note on this: There are times when developers might consider adding a stg model that is just a select all. For instance, when no cleaning is needed. This is just to reinforce layering and ensure that it is clear which layer should be built upon.
