Optimize the dbt Doc Operate with a CI | by Massimo Capobianco

I already wrote concerning the doc characteristic in dbt and the way it helps create constant and correct documentation throughout your entire dbt undertaking (see this). Briefly, you’ll be able to retailer the outline of the most typical/essential columns used within the knowledge fashions in your undertaking by including them within the docs.md information, which dwell within the docs folder of your dbt undertaking.

A quite simple instance of a orders.md file that incorporates the outline for the most typical customer-related column names:

# Fields description## order_id
{% docs orders__order_id %}
Distinctive alphanumeric identifier of the order, used to affix all order dimension tables
{% enddocs %}
## order_country
{% docs orders__order_country %}
Nation the place the order was positioned. Format is nation ISO 3166 code.
{% enddocs %}
## order_value
{% docs orders__value %}
Whole worth of the order in native forex. 
{% enddocs %}
## order_date
{% docs orders__date %}
Date of the order in native timezone
{% enddocs %}

And its utilization within the .yml file of a mannequin:

    columns:
- title: order_id
description: '{{ doc("orders__order_id") }}'

When the dbt docs are generated the outline of order_id will probably be all the time the identical, so long as the doc operate is used within the yml file of the mannequin. The advantage of having this centralized documentation is evident and plain.

The problem

Nevertheless, particularly with massive tasks and frequent modifications (new fashions, or modifications to current ones), it’s possible that the repository’s contributors will both neglect to make use of the doc operate, or they aren’t conscious {that a} particular column has been added to the docs folder. This has two penalties:

somebody should catch this throughout PR evaluate and request a change — assuming there’s not less than one reviewer who both is aware of all of the documented columns by coronary heart or all the time checks the docs folder manually
if it’s simple to go unnoticed and depends on people, this setup defeats the aim of getting a centralized documentation.

The answer

The straightforward reply to this drawback is a CI (steady integration) test, that mixes a GitHub workflow with a python script. This test fails if:

the modifications within the PR are affecting a .yml file that incorporates a column title current within the docs, however the doc operate isn’t used for that column
the modifications within the PR are affecting a .yml file that incorporates a column title current within the docs, however that column has no description in any respect

Let’s have a more in-depth take a look at the required code and information to run this test, and to a few examples. As beforehand talked about, there are two issues to contemplate: a (1) .yml file for the workflow and a (2) python file for the precise validation test.

(1) That is how the validation_docs file seems like. It’s positioned within the github/workflows folder.

title: Validate Documentationon:
pull_request:
sorts: [opened, synchronize, reopened]
jobs:
validate_docs:
runs-on: ubuntu-latest
steps:
- title: Try repository code
makes use of: actions/checkout@v3
with:
fetch-depth: 0
- title: Set up PyYAML
run: pip set up pyyaml
- title: Run validation script
run: python validate_docs.py

The workflow will run every time a pull request is open or re-open, and each time {that a} new commit is pushed to the distant department. Then there are mainly 3 steps: retrieving the repository’s information for the present pull request, set up the dependencies, and run the validation script.

(2). Then the validate_docs.py script, positioned within the root folder of your dbt undertaking repository, that appears like this

import os
import sys
import yaml
import re
from glob import glob
from pathlib import Path
import subprocessdef get_changed_files():
diff_command = ['git', 'diff', '--name-only', 'origin/main...']
end result = subprocess.run(diff_command, capture_output=True, textual content=True)
changed_files = end result.stdout.strip().break up('n')
return changed_files
def extract_doc_names():
doc_names = set()
md_files = glob('docs/**/*.md', recursive=True)
doc_pattern = re.compile(r'{%s*docss+([^s%]+)s*%}')
for md_file in md_files:
with open(md_file, 'r') as f:
content material = f.learn()
matches = doc_pattern.findall(content material)
doc_names.replace(matches)
print(f"Extracted doc names: {doc_names}")
return doc_names
def parse_yaml_file(yaml_path):
with open(yaml_path, 'r') as f:
attempt:
return record(yaml.safe_load_all(f))
besides yaml.YAMLError as exc:
print(f"Error parsing YAML file {yaml_path}: {exc}")
return []
def validate_columns(columns, doc_names, errors, model_name):
for column in columns:
column_name = column.get('title')
description = column.get('description', '')
print(f"Validating column '{column_name}' in mannequin '{model_name}'")
print(f"Description: '{description}'")
doc_usage = re.findall(r'{{s*doc(["']([^"']+)["'])s*}}', description)
print(f"Doc utilization discovered: {doc_usage}")
if doc_usage:
for doc_name in doc_usage:
if doc_name not in doc_names:
errors.append(
f"Column '{column_name}' in mannequin '{model_name}' references undefined doc '{doc_name}'."
)
else:
matching_docs = [dn for dn in doc_names if dn.endswith(f"__{column_name}")]
if matching_docs:
suggested_doc = matching_docs[0]
errors.append(
f"Column '{column_name}' in mannequin '{model_name}' ought to use '{{{{ doc("{suggested_doc}") }}}}' in its description."
)
else:
print(f"No matching doc discovered for column '{column_name}'")
def predominant():
changed_files = get_changed_files()
yaml_files = [f for f in changed_files if f.endswith('.yml') or f.endswith('.yaml')]
doc_names = extract_doc_names()
errors = []
for yaml_file in yaml_files:
if not os.path.exists(yaml_file):
proceed  
yaml_content = parse_yaml_file(yaml_file)
for merchandise in yaml_content:
if not isinstance(merchandise, dict):
proceed
fashions = merchandise.get('fashions') or merchandise.get('sources')
if not fashions:
proceed
for mannequin in fashions:
model_name = mannequin.get('title')
columns = mannequin.get('columns', [])
validate_columns(columns, doc_names, errors, model_name)
if errors:
print("Documentation validation failed with the next errors:")
for error in errors:
print(f"- {error}")
sys.exit(1)
else:
print("All documentation validations handed.")
if __name__ == "__main__":
predominant()

Let’s summarise the steps within the script:

it lists all information which were modified within the pull request in comparison with the origin department.
it seems via all markdown (.md) information inside the docs folder (together with subdirectories) and it searches for particular documentation block patterns utilizing a regex. Every time it finds such a sample, it extracts the doc_name half and provides it to a set of doc names.
for every modified .yml file, the script opens and parses it utilizing yaml.safe_load_all. This converts the .yml content material into Python dictionaries (or lists) for straightforward evaluation.
validate_columns: for every columns outlined within the .yml information, it checks the outline subject to see if it features a {{ doc() }} reference. If references are discovered, it verifies that the referenced doc title truly exists within the set of doc names extracted earlier. If not, it studies an error. If no doc references are discovered, it makes an attempt to see if there’s a doc block that matches this column’s title. Word that right here we’re utilizing a naming conference like doc_block__column_name. If such a block exists, it means that the column description ought to reference this doc.
Any issues (lacking doc references, non-existent referenced docs) are recorded as errors.

Examples

Now, let’s take a look on the CI in motion. Given the orders.md file shared originally of the article, we now push to distant this commit that incorporates the ref_country_orders.yml file:

model: 2fashions:
- title: ref_country_orders
description: >
This mannequin filters orders from the staging orders desk to incorporate solely these with an order date on or after January 1, 2020. 
It contains data such because the order ID, order nation, order worth, and order date.
columns:
- title: order_id
description: '{{ doc("orders__order_id") }}'
- title: order_country
description: The nation the place the order was positioned.
- title: order_value
description: The worth of the order.
- title: order_address
description: The deal with the place the order was positioned.
- title: order_date

The CI has failed. Clicking on the Particulars will take us to the log of the CI, the place we see this:

Validating column 'order_id' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_id") }}'
Doc utilization discovered: ['orders__order_id']
Validating column 'order_country' in mannequin 'ref_country_orders'
Description: 'The nation the place the order was positioned.'
Doc utilization discovered: []
Validating column 'order_value' in mannequin 'ref_country_orders'
Description: 'The worth of the order.'
Doc utilization discovered: []
Validating column 'order_address' in mannequin 'ref_country_orders'
Description: 'The deal with the place the order was positioned.'
Doc utilization discovered: []
No matching doc discovered for column 'order_address'
Validating column 'order_date' in mannequin 'ref_country_orders'
Description: ''
Doc utilization discovered: []

Let’s analyze the log:
– for the order_id column it discovered the doc utilization in its description.
– the order_address column isn’t discovered within the docs file, so it returns a No matching doc discovered for column ‘order_address’
– for the order_value and order_country, it is aware of that they’re listed within the docs however the doc utilization is empty. Similar for the order_date, and be aware that for this one we didn’t even add an outline line

All good up to now. However let’s hold wanting on the log:

Documentation validation failed with the next errors:
- Column 'order_country' in mannequin 'ref_country_orders' ought to use '{{ doc("orders__order_country") }}' in its description.
- Column 'order_value' in mannequin 'ref_country_orders' ought to use '{{ doc("orders__order_value") }}' in its description.
- Column 'order_date' in mannequin 'ref_country_orders' ought to use '{{ doc("orders__order_date") }}' in its description.
Error: Course of accomplished with exit code 1.

Since order_country, order_value, order_date are within the docs file, however the doc operate isn’t used, the CI increase an error. And it suggests the precise worth so as to add within the description, which makes it extraordinarily simple for the PR creator to copy-paste the proper description worth from the CI log and add it into the .yml file.

After pushing the brand new modifications the CI test was succesfull and the log now seems like this:

Validating column 'order_id' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_id") }}'
Doc utilization discovered: ['orders__order_id']
Validating column 'order_country' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_country") }}'
Doc utilization discovered: ['orders__order_country']
Validating column 'order_value' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_value") }}'
Doc utilization discovered: ['orders__order_value']
Validating column 'order_address' in mannequin 'ref_country_orders'
Description: 'The deal with the place the order was positioned.'
Doc utilization discovered: []
No matching doc discovered for column 'order_address'
Validating column 'order_date' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_date") }}'
Doc utilization discovered: ['orders__order_date']
All documentation validations handed.

For the order_address column, the log exhibits that no matching doc was discovered. Nevertheless, that’s superb and doesn’t trigger the CI to fail, since including that column to the docs file isn’t our intention for this demonstration. In the meantime, the remainder of the columns are listed within the docs and are appropriately utilizing the {{ doc() }} operate

Optimize the dbt Doc Operate with a CI | by Massimo Capobianco | Jan, 2025

The problem

The answer

Examples

Prime 10 AI Instruments for Inventory Buying and selling

7 Errors Knowledge Scientists Make When Making use of for Jobs

60 Python Interview Questions For Information Analyst

Python functools & itertools: 7 Tremendous Useful Instruments for Smarter Code

Excessive-impact laptop imaginative and prescient in provide chain

Prime 10 AI Instruments for Inventory Buying and selling

7 Errors Knowledge Scientists Make When Making use of for Jobs

60 Python Interview Questions For Information Analyst

Python functools & itertools: 7 Tremendous Useful Instruments for Smarter Code