Minds Dbt Project Documentation
Introduction
The Minds dbt project is structured as a series of transformation layers which start with raw data and progress toward data suitable for end use in business analytics and product features. The rough flow is as follows:
- Data sources 🡒 Data staging 🡒 Intermediate processing 🡒 Domain models 🡒 KPI models and presentation models
For any models which require non-trivial incremental processing, there are also "checkpoint" models which freeze the state of an incremental model as of the last dbt run (and circumvent dbt reference cycle detection).
Operations
To work with the Minds dbt project, first make sure you have your environment set up and running as explained in the dbt introduction here
Bootstrap vs Incremental vs Full Rebuild
Dbt has basic built-in support for incremental processing of models (i.e. the entire database table will not need to be rebuilt from scratch each time dbt runs). This is particularly important when the tables are very large and the processing to create them is complex and compute-intensive (which is the case for a number of models in the Minds project). However, the built-in support has certain limitations, particularly when the output of a model from a previous run is used as an upstream input to same the model in a subsequent incremental run (a circular dependency). This situation arises for a number of Minds models.
In order to process these sorts of models, the Minds dbt project employs techniques to manage certain circular dependencies that arise. This is managed through "checkpoint" models (found in the dbt/models/checkpoint
folder) and conditional inclusion blocks in the models that need to resolve circular dependencies for incremental processing.
One consequence is that a bootstrap process is required the first time a new (empty) schema is created from the dbt project. This creates the initial tables and checkpoints required to start incremental processing.
To "bootstrap", that is, to start a new schema (e.g. for a new developer), first run the following dbt command (this will take approximately 3 minutes to complete):
(dbtenv) $ dbt run --full-refresh --select \
chk_last_contexts_full \
chk_last_entity_contexts_full \
chk_last_sessions_full \
chk_network_user_mapping_full \
chk_tstamp_interval_full \
stg_session_contexts \
stg_entity_contexts \
stg_sessions \
int_network_user_id_map \
stg_session_events
This will create the database tables required for incremental processing, and will process the first 28 days of event data found in the snowplow_atomic.events
source table. After this completes, subsequent incremental runs (the default behaviour of dbt run
) will process up to 28 days worth of additional event data from the snowplow_atomic.events
table until the incremental tables in the new schema have caught up to the latest records in the snowplow_atomic.events
table.
If at some point it becomes necessary to rebuild an existing developer schema from scratch, add the --full-refresh
directive to the dbt command line as follows:
(dbtenv) $ dbt run --full-refresh
This will rebuild all tables and set the entire project back to the initial 28-day processing of snowplow event data.
Developer vs Production
The default behavior of dbt is to run the project with the dev
target, and process the tables in the corresponding schema configured in the profiles.yml
file. If it becomes necessary to manually run the project (or an updated model) in the production schema (not recommended), this can be done by adding the -t prod
target specification to the command line as follows:
(dbtenv) $ dbt run -t prod
This will update the tables in the production schema, dbt_prod
, directly
Transformation Layers
Each layer has its own directory under minds-etl/dbt/models
Sources
The Minds dbt project has six data sources it uses as starting points for the data transformation model. Data from these sources is loaded into Snowflake via Prefect ETL flows prior to running the dbt project. The following files in the dbt/models/sources
directory specify the names and locations of each source table used in the project:
- cassandra.yml: tables imported from cassandra, including separate tables split out from the cassandra
entities
table - elasticsearch.yml: "engagement events" extracted from ElasticSearch server
- mautic.yml: tables of email campaign data extracted via the Mautic API
- snowplow.yml: raw Snowplow event data (
events
) and processed contextual data for aggregation (event_contexts
) - stripe.yml: tables of financial transaction data via the Stripe API
- vitess.yml: tables extracted from the vitess application database (e.g.
boosts
) that are used for analysis
Staging
The staging
models are, in most cases, slightly cleaned-up (deduplicated, slowly-changing dimensioned) versions of the source data. A few models perform substantive transformations, however. These are documented in a separate section below.
Intermediate
The models in the intermediate
layer perform a variety of utility and helper transformations for models in the staging
and domain
layers, and also for a number of Prefect flows (such as for making content recommendations).
Domain
The models in the domain
layer represent a consolidated entity-relation data model of the Minds application domain. The models in this layer are intended to either be queried directly, or combined, filtered, and aggregated into task-specific KPI and presentation models.
KPI
The models in the kpi
layer are aggregations designed to support dashboards and other reporting of specific areas of business KPIs (e.g. "acquisition", "retention", etc.).
Presentation
The models in the presentation
layer are composed from models in lower layers and restructured to facilitate use in interactive reporting and visualisation tools such as Superset.
Events, Sessions, and Aggregation
These staging models form the basis for aggregated processing of user interaction data and application events. The purpose of these models is to abstract the essential data from the fine-grained user-event data generated by Snowplow, so that the original fine-grained data can be discarded according to a data-retention policy without compromising the legitimate product feature and business analytics requirements for the aggregate data.
The models start with the raw snowplow_atomic.events
records, and the processed snowplow_atomic.event_contexts
records, and build incremental models which provide for the foreseeable downstream data use cases. The models are as follows:
stg_session_events
This model is the starting point for user event aggregation. It uses as data sources the raw events from snowplow_atomic.events
, the processed events from snowplow_atomic.events_contexts
, and the int_network_user_id_map
table, which maps a snowplow network_userid
to a native Minds application user_id
. This model extracts out the values that are relevant for composing records for the stg_sessions
, stg_session_contexts
, and stg_entity_contexts
models.
The output fields of this model are as follows:
- event_id: the id of the original Snowplow event, as found in
snowplow_atomic.events
andsnowplow_atomic_event_contexts
- collector_tstamp: the collector timestamp on the original Snowplow event
- user_id: the Minds native
user_id
if the value was set on the original event, or if it was mapped viaint_network_user_id_map
; otherwise it will be thenetwork_userid
of the original Snowplow event - event_type: either the value of the
action
attribute if the event is an action event, or the value ofevent_name
otherwise - user_ipaddress: the value or
user_ipaddress
from the original Snowplow event - useragent: the value of
useragent
from the original Snowplow event - br_name: the value of
br_name
from the original Snowplow event - dvce_type: The value of
dvce_type
from the original Snowplow event - entity_guid: the value of
entity_guid
as extracted in the correspondingsnowplow_atomic.event_contexts
record - event_name: the value of
event_name
as extracted in the correspondingevent_contexts
record - session_context: a string representing a json map of attribute/value pairs extracted from the snowplow event that are applicable to a user session
- event_context: a string representing a json map of attribute/value pairs extracted from the snowplow event that can be aggregated on a per-session and per-entity basis for downstream analysis
stg_sessions
This model aggregates event information over a single user session, where a user session is defined as all events associated with a single user_id
(value from stg_session_events
record) within an inteval of 24 hours or less where there is no time delay of more than 30 minutes between two subsequent events. The model aggregates per-session event information which can be further filtered and aggregated for per-user, per-day, per-cohort, etc. reporting and analysis without exposing the specific details of user behaviour on site.
The output fields of this model are as follows:
- user_id: the
user_id
value used to aggregate records fromstg_session_events
- userid_type: the type of user id; possible values are
user
for a native Minds user guid,pseudo
for a masked pseudonymous Snowplow user id, ornetwork
for an anonymous (not logged-in) Snowplow-assigned identity - session_number: the ordinal number of this session for the specified
user_id
, as found in thestg_session_events
table - session_day: in the event that a session delimited by a 30-minute inactivity window would last longer than 24 hours, the session is broken down into sub-sessions of no longer than 24 hours each, and assigned a
session_day
number, starting with0
, in chronological order, with thesession_number
value staying the same for each sub-session record - network_userid: a Snowflake array value of all
network_userid
values found on event records associated with this user session; in the normative case, this array will contain one and only one value - user_ipaddress: a Snowflake array value of all
user_ipaddress
values found on event records associated with this user session; in the normative case, this array will contain one and only one value - useragent: a Snowflake array value of all
useragent
values found on event records associated with this user session; in the normative case, this array will contain one and only one value - br_name: a Snowflake array value of all
br_name
values found on event records associated with this user session; in the normative case, this array will contain one and only one value - dvce_type: a Snowflake array value of all
dvce_type
values found on event records associated with this user session; in the normative case, this array will contain one and only one value - user_first_tstamp: the
collector_tstamp
value of the first event in thestg_session_events
table with thisuser_id
value (or, alternately, thefirst_tstamp
value of the first session associated with thisuser_id
) - first_tstamp: the
collector_tstamp
value of the first event record associated with this session - last_tstamp: the
collector_tstamp
value of the last event record associated with this session - duration: the time in seconds between the
first_tstamp
andlast_tstamp
values - event_count: the total count of event records associated with this session
The following fields contain the aggregated count of events of each event_type
value associated with this session:
- view_count
- screen_view_count
- page_ping_count
- growthbook_count
- active_count
- vote_up_count
- vote_up_cancel_count
- vote_down_count
- vote_down_cancel_count
- create_count
- delete_count
- remind_count
- comment_count
- login_count
- subscribe_count
- boost_count
- accept_count
- email_confirm_count
- email_click_count
- email_unsubscribe_count
- block_count
- unblock_count
- join_count
- joined_rewards_count
stg_session_contexts
This model aggregates the session_context
values from snowplow_atomic.event_contexts
that are associated with each user session in stg_sessions
.
The output fields of this model are as follows:
- user_id: as in
stg_sessions
- session_number: as in
stg_sessions
- session_day: as in
stg_sessions
- context: the value of the
session_context
field in thesnowplow_atomic.session_contexts
table - first_tstamp: the
collector_tstamp
value of the first event in the associated session that has thiscontext
value - last_tstamp: the
collector_tstamp
value of the last event in the associated session that has thiscontext
value - event_count: the count of events associated with this session that have this
context
value
stg_entity_contexts
This model aggregates the event_context
values from stg_session_events
for a particular entity_guid
value within each one-hour wall-clock interval.
The output fields of this model are as follows:
- entity_guid: the
entity_guid
value fromstg_session_events
- event_hour: the one-hour wall-clock interval used for aggregation (timestamp of the first second of the hour)
- event_name: the
event_name
value fromstg_session_events
- event_context: the
event_context
value fromstg_session_events
- first_tstamp: the
collector_tstamp
value of the first aggregated event fromstg_session_events
that has thisevent_context
value - last_tstamp: the
collector_tstamp
value of the last aggregated event fromstg_session_events
that has thisevent_context
value - event_count: the count of aggregated events that have this
event_context
value