Skip to main content

Dbt

Introduction

Dbt is an open-source data transformation framework written in python. It builds and manages a directed acyclic graph ("DAG") of individual SQL SELECT statement files (called "models"). The output of each model file can be used as an input to any number of further models, creating the edges of the DAG. The dbt framework checks the syntax and referential consistency of the individual model files and the collective DAG. Each model file performs a transformation of upstream data which dbt will preserve in the target DBMS as a table, transient table, view, or materialized view. The output of the models thereby become available for queries by external data analysis or production software.

Getting Started

In order to work with dbt, you will need to have a Python environment with the appropriate dbt package installed. The recommended way to do this is to create a Python 3 virtual environment (or reuse an existing Python 3 virtual environment), and install the appropriate dbt packages with pip. Because the data warehouse used by Minds is Snowflake, install the dbt-snowflake python package as follows:

$ python3 -m venv ~/dbtenv
$ . ~/dbtenv/activate
(dbtenv) $ pip install dbt-snowflake

To run the dbt project, you will need a profiles.yml (usually located in your ~/.dbt directory, depending on platform) configured with the database access details, as follows:

minds_analytics:
target: dev
outputs:
dev:
type: snowflake
account: zja19259.us-east-1
user: <DEV_USERNAME>
password: <DEV_PASSWD>
role: ETL_ROLE
warehouse: SUPERCOMPUTE_WH
database: MINDS_ANALYTICS
schema: <DEV_SCHEMA_NAME>
threads: 6
prod:
type: snowflake
account: zja19259.us-east-1
user: minds_etl
password: <PROD_PASSWD>
role: ETL_ROLE
warehouse: SUPERCOMPUTE_WH
database: MINDS_ANALYTICS
schema: dbt_prod
threads: 6

You can confirm that your dbt installation is working correctly by changing the current directory into the ./dbt subdirectory of the minds-etl repository (the same directory as the dbt_project.yml file), and running the dbt debug command, which should produce output indicating success, as follows:

(dbtenv) $ dbt debug
10:38:18 Running with dbt=1.6.1
10:38:18 dbt version: 1.6.1
10:38:18 python version: 3.11.4
10:38:18 python path: /home/XXXXXX/dbtenv/bin/python3
10:38:18 os info: Linux-6.2.0-31-generic-x86_64-with-glibc2.37
10:38:18 Using profiles dir at /home/XXXXXX/.dbt
10:38:18 Using profiles.yml file at /home/XXXXXX/.dbt/profiles.yml
10:38:18 Using dbt_project.yml file at /home/XXXXXX/Minds/git/minds-etl/dbt/dbt_project.yml
10:38:18 adapter type: snowflake
10:38:18 adapter version: 1.6.1
10:38:18 Configuration:
10:38:18 profiles.yml file [OK found and valid]
10:38:18 dbt_project.yml file [OK found and valid]
10:38:18 Required dependencies:
10:38:18 - git [OK found]

10:38:18 Connection:
10:38:18 account: zja19259.us-east-1
10:38:18 user: XXXXXX
10:38:18 database: MINDS_ANALYTICS
10:38:18 warehouse: SUPERCOMPUTE_WH
10:38:18 role: ETL_ROLE
10:38:18 schema: dbt_XXXXXX
10:38:18 authenticator: None
10:38:18 private_key: None
10:38:18 private_key_path: None
10:38:18 private_key_passphrase: None
10:38:18 token: None
10:38:18 oauth_client_id: None
10:38:18 query_tag: None
10:38:18 client_session_keep_alive: False
10:38:18 host: None
10:38:18 port: None
10:38:18 proxy_host: None
10:38:18 proxy_port: None
10:38:18 protocol: None
10:38:18 connect_retries: 1
10:38:18 connect_timeout: None
10:38:18 retry_on_database_errors: False
10:38:18 retry_all: False
10:38:18 insecure_mode: False
10:38:18 reuse_connections: None
10:38:18 Registered adapter: snowflake=1.6.1
10:38:19 Connection test: [OK connection ok]

10:38:19 All checks passed!

If dbt debug does not indicate success, use the output to diagnose and correct potential configuration, user credentials, and network connectivity issues.

Dbt Docs

Dbt has a built-in self-documenting system. It will generate and serve a web interface which allows you to navigate the complete project. This is particularly useful for the graphical DAG view (shown below).

To generate documentation for the current state of the project, in the dbt directory, run the dbt docs generate command, as follows:

(dbtenv) $ $ dbt docs generate
10:51:01 Running with dbt=1.6.1
10:51:01 Registered adapter: snowflake=1.6.1
10:51:01 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.minds_analytics.utils
10:51:01 Found 80 models, 1 seed, 42 sources, 0 exposures, 0 metrics, 560 macros, 0 groups, 0 semantic models
10:51:01
10:51:02 Concurrency: 6 threads (target='dev')
10:51:02
10:51:03 Building catalog
10:51:09 Catalog written to /home/XXX/Minds/git/minds-etl/dbt/target/catalog.json

You can then navigate the generated documentation in a web browser by running the documentation server as follows:

(dbtenv) $ dbt docs serve --no-browser --port=8765
10:52:37 Running with dbt=1.6.1
Serving docs at 8765
To access from your browser, navigate to: http://localhost:8765



Press Ctrl+C to exit.

Load the indicated URL in a browser, and if everthing is successful, you should see a project overview page:

Dbt project documentation

All the data sources and models can be navigated and viewed using the sidebar on the left.

In the lower right, corner, you will see a small circular blue button. This is the access to the graphical DAG view of the project. Click it, and you should see an interface as follows:

Dbt DAG view, full

This shows the dependency connections between all the data sources and data models, flowing from left to right. This view can be panned and zoomed using the mouse, and the upstream and downstream dependencies can be highlighted by clicking on a model or source, as follows:

Dbt DAG view, zoomed

Running Dbt

Compiled files

Logs