Practice Project: Checkpoint 3

Refactoring an old script that generates invoices to our Finance department.

Oct 07, 2025

Part of the “Mastering dbt” series. Access to the full Study Guide. Let’s connect on LinkedIn!

In the previous Checkpoint, we made our project more interesting by adding additional models and configurations. Now, we’re going to practice making legacy SQL code modular, following the Refactoring principles.

In the documentation, there is a very basic example of a model that barely disturbs the overall structure of the DAG. Our example is more realistic, though.

I created a confusing and repetitive legacy script with bits of incorrect logic and unnecessary functions. This script was used to create invoices that apply discounts and VAT on top of the gross value.

1. Migrating the legacy code

Create a file in the project for the legacy code and work off a copy.

The legacy code and the table it produces.

2. Implementing dbt sources

Replace direct references with the sources function.

We already have our sources defined in a sources.yml file, so we can skip that part and go straight into replacing the direct references.

Using the sources function denotes the sources in the DAG and makes the code easier to adjust in case of name changes.

3. Implementing cosmetic cleanup and CTE groupings

Cosmetic cleanup

First, let’s start with the cosmetic cleanup. This entails adding whitespace, breaking up long lines of code, lowercasing SQL and any other actions necessary to make the code easier to understand at the top level.

This is a tedious step, but it helps a lot to decode a messy script. I used the Format function in the IDE that automatically applies some formatting and made my own changes according to what helps me understand the code.

CTE groupings

Next, we are going to apply the CTE groupings: Import CTEs, Logic CTEs, and Final CTEs. At this stage, we are not fixing logic because we want the final output to look just like the legacy one, so we can audit our changes.

Import CTEs

For the Import CTEs, I had to select from staging and intermediate models, which could potentially break DAG principles once I have layered everything. But at this stage, I just want the output to work and match the legacy table. Finally, you will need to replace column names throughout.

Logic CTEs

Now, we isolate the nested select statements in their individual CTEs, starting from the innermost one. In this process, we can also simplify the code and remove unnecessary parts.

This was the most labourious part, especially because I noticed a logic mistake in one of my existing models.

Logic CTEs containing the nested selects and other necessary steps.

Final CTE

The Final CTE puts it all together, emulating the legacy table.

4. Applying standardised layering

At this stage, you should have one long script and possibly some transformations that were already available elsewhere in the project. This would be ready for auditing, although not very DRY or respecting dbt’s best practices for DAGs.

So, we need to spread this code across our DAG in accordance with modularity principles. I did this in 2 steps:

Addressing reusability:
Move parts of the code to either existing models or update the reference when that transformation already existed elsewhere. Create new models for new aggregations following layering principles. The objective is to leave the marts model free of complex transformations or source references.
Blending the new model(s) into the DAG following the best practices.
For example, ensuring that a marts model isn’t pulling from intermediate and staging models and avoiding bends between intermediate models.

Addressing reusability

Firstly, make sure to work on a copy of the refactored code.

I was able to replace some references with transformations that already existed upstream. However, it was necessary to create two new models for new aggregations: the “int_order_totals” model, which aggregates order totals, and the “int_invoice_totals” which contains discount logic.

The DAG for the newly created “fct_invoice” model after moving bits of code.

This works, however, we’re breaking a few DAG best practices.

a marts model pulling from staging and intermediate models
“int_invoice_totals” is pulling from two intermediate models.

Blending new models following DAG best practices

I ended up changing the entire structure of the project to make it adhere to best practices.

Essentially, we ended up with two intermediate layers: the first is joining company, employee, and orders to each other in each of their grains, while the second is creating aggregations for orders.

Due to these changes, the “inv_invoice_totals” became obsolete, and the DAG looks much more streamlined.

The DAG just before auditing. There could still be improvements, like pulling the dim tables from the newly created intermediate models. But that’s a job for later!

Finally, the final selection became a “marts” model in the “finance” folder.

5. Auditing

You can use the audit helper package to view differences between the legacy and the new model. There are a ton of helpful macros that speed up this process, but I only used “compare_all_columns” and then created my own query to compare the columns that showed differences.

I had two main issues with my code:

The legacy code had all invoice metrics as numeric, while the new code had them as floats. This was causing rounding issues. It took me a long time to figure this one out.
There was an incorrect calculation of VAT in the legacy code, as it assumed each order only contained one product. In that case, I didn’t change the new code, of course, but ensured that that was what was causing the discrepancies.

6. Extra step: add configs to the new models

Thanks to the addition of two new models (fct_invoices and int_order_totals), we now have to make sure they are configured in the properties.yml and dbt_project.yml file.

Andrea Leonel - Data Analysis & Analytics Engineering

Discussion about this post

Ready for more?