Dbt
Introduction
Dbt is an open-source data transformation framework written in python. It builds and manages a directed acyclic graph ("DAG") of individual SQL SELECT
statement files (called "models"). The output of each model file can be used as an input to any number of further models, creating the edges of the DAG. The dbt framework checks the syntax and referential consistency of the individual model files and the collective DAG. Each model file performs a transformation of upstream data which dbt will preserve in the target DBMS as a table, transient table, view, or materialized view. The output of the models thereby become available for queries by external data analysis or production software.
Getting Started
In order to work with dbt, you will need to have a Python environment with the appropriate dbt package installed. The recommended way to do this is to create a Python 3 virtual environment (or reuse an existing Python 3 virtual environment), and install the appropriate dbt packages with pip
. Because the data warehouse used by Minds is Snowflake, install the dbt-snowflake
python package as follows:
$ python3 -m venv ~/dbtenv
$ . ~/dbtenv/activate
(dbtenv) $ pip install dbt-snowflake
To run the dbt project, you will need a profiles.yml
(usually located in your ~/.dbt
directory, depending on platform) configured with the database access details, as follows:
minds_analytics:
target: dev
outputs:
dev:
type: snowflake
account: zja19259.us-east-1
user: <DEV_USERNAME>
password: <DEV_PASSWD>
role: ETL_ROLE
warehouse: SUPERCOMPUTE_WH
database: MINDS_ANALYTICS
schema: <DEV_SCHEMA_NAME>
threads: 6
prod:
type: snowflake
account: zja19259.us-east-1
user: minds_etl
password: <PROD_PASSWD>
role: ETL_ROLE
warehouse: SUPERCOMPUTE_WH
database: MINDS_ANALYTICS
schema: dbt_prod
threads: 6
You can confirm that your dbt installation is working correctly by changing the current directory into the ./dbt
subdirectory of the minds-etl
repository (the same directory as the dbt_project.yml
file), and running the dbt debug
command, which should produce output indicating success, as follows:
(dbtenv) $ dbt debug
10:38:18 Running with dbt=1.6.1
10:38:18 dbt version: 1.6.1
10:38:18 python version: 3.11.4
10:38:18 python path: /home/XXXXXX/dbtenv/bin/python3
10:38:18 os info: Linux-6.2.0-31-generic-x86_64-with-glibc2.37
10:38:18 Using profiles dir at /home/XXXXXX/.dbt
10:38:18 Using profiles.yml file at /home/XXXXXX/.dbt/profiles.yml
10:38:18 Using dbt_project.yml file at /home/XXXXXX/Minds/git/minds-etl/dbt/dbt_project.yml
10:38:18 adapter type: snowflake
10:38:18 adapter version: 1.6.1
10:38:18 Configuration:
10:38:18 profiles.yml file [OK found and valid]
10:38:18 dbt_project.yml file [OK found and valid]
10:38:18 Required dependencies:
10:38:18 - git [OK found]
10:38:18 Connection:
10:38:18 account: zja19259.us-east-1
10:38:18 user: XXXXXX
10:38:18 database: MINDS_ANALYTICS
10:38:18 warehouse: SUPERCOMPUTE_WH
10:38:18 role: ETL_ROLE
10:38:18 schema: dbt_XXXXXX
10:38:18 authenticator: None
10:38:18 private_key: None
10:38:18 private_key_path: None
10:38:18 private_key_passphrase: None
10:38:18 token: None
10:38:18 oauth_client_id: None
10:38:18 query_tag: None
10:38:18 client_session_keep_alive: False
10:38:18 host: None
10:38:18 port: None
10:38:18 proxy_host: None
10:38:18 proxy_port: None
10:38:18 protocol: None
10:38:18 connect_retries: 1
10:38:18 connect_timeout: None
10:38:18 retry_on_database_errors: False
10:38:18 retry_all: False
10:38:18 insecure_mode: False
10:38:18 reuse_connections: None
10:38:18 Registered adapter: snowflake=1.6.1
10:38:19 Connection test: [OK connection ok]
10:38:19 All checks passed!
If dbt debug
does not indicate success, use the output to diagnose and correct potential configuration, user credentials, and network connectivity issues.
Dbt Docs
Dbt has a built-in self-documenting system. It will generate and serve a web interface which allows you to navigate the complete project. This is particularly useful for the graphical DAG view (shown below).
To generate documentation for the current state of the project, in the dbt
directory, run the dbt docs generate
command, as follows:
(dbtenv) $ $ dbt docs generate
10:51:01 Running with dbt=1.6.1
10:51:01 Registered adapter: snowflake=1.6.1
10:51:01 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.minds_analytics.utils
10:51:01 Found 80 models, 1 seed, 42 sources, 0 exposures, 0 metrics, 560 macros, 0 groups, 0 semantic models
10:51:01
10:51:02 Concurrency: 6 threads (target='dev')
10:51:02
10:51:03 Building catalog
10:51:09 Catalog written to /home/XXX/Minds/git/minds-etl/dbt/target/catalog.json
You can then navigate the generated documentation in a web browser by running the documentation server as follows:
(dbtenv) $ dbt docs serve --no-browser --port=8765
10:52:37 Running with dbt=1.6.1
Serving docs at 8765
To access from your browser, navigate to: http://localhost:8765
Press Ctrl+C to exit.
Load the indicated URL in a browser, and if everthing is successful, you should see a project overview page:
All the data sources and models can be navigated and viewed using the sidebar on the left.
In the lower right, corner, you will see a small circular blue button. This is the access to the graphical DAG view of the project. Click it, and you should see an interface as follows:
This shows the dependency connections between all the data sources and data models, flowing from left to right. This view can be panned and zoomed using the mouse, and the upstream and downstream dependencies can be highlighted by clicking on a model or source, as follows: