Best practice for workflows
Highlighting more best practices to keep in mind when building your dbt projects and familiarising ourselves with more advanced functionalities.
Part of the “Mastering dbt” series. Access to the full Study Guide. Let’s connect on LinkedIn!
Notes from this dbt documentation.
As we continue to dive deeper into the world of the dbt, we are going to explore another set of best practices to consider when building your projects.
Some of the functionality covered here is yet to be covered by this Study Guide, but it is never too soon to start familiarising ourselves with these concepts.
I have taken the liberty of re-categorising these items from the original post for more clarity.
Sources
Use the ref function
Allows dbt to infer dependencies that affect the order in which models are built.
Ensures that the tables and views referenced are in the same environment.
Limit references to raw data
Keep the direct link to the source data in only one place so, in case anything changes in the source, there are fewer amendments to make.
Use the source function instead of direct linking.
Layering
Separate source-centric and business-centric transformations
Staging should be your source-centric layer where you would transform the sources into consistent structures.
From intermediate onwards, you should apply business logic and represent entities and processes relevant to your business.
Rename and recast fields once
The first layer of transformations should be:
select from only one source.
rename fields and tables to fit naming conventions.
recast fields to the correct data type.
All the downstream models should be built upon this clean model.
Break complex models into smaller pieces
It is a good idea to break up complex models when:
A CTE is duplicated across two models. Place this CTE in a separate model to be referenced by other models and avoid duplicate code.
A CTE changes the grain of the data. The grain of the data is what one record (each row) represents.
A model with too many lines. The recommendation is to keep the length at around 100 lines max.
Group your models in directories
Use directories to make it easier to configure groups of models, run subsections of the DAG, clarify modeling steps, and create bespoke conventions.
Consider the information architecture of your data warehouse
Use custom schemas to hide or show models as needed. For instance, create a separate schema where intermediate models are not shown.
Generally, the custom schemas align with the model directories.
Use prefixes to clarify the relations (stg_, fct_, dim_).
Coding & Conventions
Use a style guide
SQL styles, naming conventions and other rules that should be documented, especially if multiple users are writing code.
Manage whitespace generated by Jinja
Sometimes, when using jinja or macros, you may end up with unwanted whitespaces in your compiled code. You should edit your jinja/macro to manage that.
Materialisations
Choose your materialisations wisely
Use views by default - they are faster to build but slower to query.
Use tables for:
models that are queried by BI tools.
models that have multiple descendants.
Use ephemeral materialisations for:
models that only do lightweight transformations that won’t be displayed to end-users.
Use incremental models for:
situations when tables are taking too long to be built - incremental models improve build time but add complexity to the tables.
Version control
Version control your dbt project
Usage of git branches to develop new features and bug fixes.
Peer review of code changes in a Pull Request.
Use grants to manage privileges on objects
The grants config allows you to apply permissions to dbt objects in a way that you can version control and duplicate them.
Environments
Use separate developments and production environments
Usage of targets defined in a profiles.yml file to separate production from development.
Limit the data processed when in development by:
Using the target name:
Using the environment variable DBT_CLOUD_INVOCATION_CONTEXT that produces the possible values of prod, dev, staging, and ci to determine the environment.
Testing
Add tests to your models
Use tests to ensure that the data is as expected.
At a minimum, every model should be tested for uniqueness and non-null of the primary key.