Minds Dbt Project Documentation
Introduction
The Minds dbt project is structured as a series of transformation layers which start with raw data and progress toward data suitable for end use in business analytics and product features. The rough flow is as follows:
- Data sources 🡒 Data staging 🡒 Intermediate processing 🡒 Domain models 🡒 KPI models and presentation models
For any models which require non-trivial incremental processing, there are also "checkpoint" models which freeze the state of an incremental model as of the last dbt run (and circumvent dbt reference cycle detection).
Operations
To work with the Minds dbt project, first make sure you have your environment set up and running as explained in the dbt introduction here
Bootstrap vs Incremental vs Full Rebuild
Dbt has basic built-in support for incremental processing of models (i.e. the entire database table will not need to be rebuilt from scratch each time dbt runs). This is particularly important when the tables are very large and the processing to create them is complex and compute-intensive (which is the case for a number of models in the Minds project). However, the built-in support has certain limitations, particularly when the output of a model from a previous run is used as an upstream input to same the model in a subsequent incremental run (a circular dependency). This situation arises for a number of Minds models.
In order to process these sorts of models, the Minds dbt project employs techniques to manage certain circular dependencies that arise. This is managed through "checkpoint" models (found in the dbt/models/checkpoint folder) and conditional inclusion blocks in the models that need to resolve circular dependencies for incremental processing.
One consequence is that a bootstrap process is required the first time a new (empty) schema is created from the dbt project. This creates the initial tables and checkpoints required to start incremental processing.
To "bootstrap", that is, to start a new schema (e.g. for a new developer), first run the following dbt command (this will take approximately 3 minutes to complete):
(dbtenv) $ dbt run --full-refresh --select \
chk_last_contexts_full \
chk_last_entity_contexts_full \
chk_last_sessions_full \
chk_network_user_mapping_full \
chk_tstamp_interval_full \
stg_session_contexts \
stg_entity_contexts \
stg_sessions \
int_network_user_id_map \
stg_session_events
This will create the database tables required for incremental processing, and will process the first 28 days of event data found in the snowplow_atomic.events source table. After this completes, subsequent incremental runs (the default behaviour of dbt run) will process up to 28 days worth of additional event data from the snowplow_atomic.events table until the incremental tables in the new schema have caught up to the latest records in the snowplow_atomic.events table.
If at some point it becomes necessary to rebuild an existing developer schema from scratch, add the --full-refresh directive to the dbt command line as follows:
(dbtenv) $ dbt run --full-refresh
This will rebuild all tables and set the entire project back to the initial 28-day processing of snowplow event data.
Developer vs Production
The default behavior of dbt is to run the project with the dev target, and process the tables in the corresponding schema configured in the profiles.yml file. If it becomes necessary to manually run the project (or an updated model) in the production schema (not recommended), this can be done by adding the -t prod target specification to the command line as follows:
(dbtenv) $ dbt run -t prod
This will update the tables in the production schema, dbt_prod, directly
Transformation Layers
Each layer has its own directory under minds-etl/dbt/models
Sources
The Minds dbt project has six data sources it uses as starting points for the data transformation model. Data from these sources is loaded into Snowflake via Prefect ETL flows prior to running the dbt project. The following files in the dbt/models/sources directory specify the names and locations of each source table used in the project:
- cassandra.yml: tables imported from cassandra, including separate tables split out from the cassandra
entitiestable - elasticsearch.yml: "engagement events" extracted from ElasticSearch server
- mautic.yml: tables of email campaign data extracted via the Mautic API
- snowplow.yml: raw Snowplow event data (
events) and processed contextual data for aggregation (event_contexts) - stripe.yml: tables of financial transaction data via the Stripe API
- vitess.yml: tables extracted from the vitess application database (e.g.
boosts) that are used for analysis
Staging
The staging models are, in most cases, slightly cleaned-up (deduplicated, slowly-changing dimensioned) versions of the source data. A few models perform substantive transformations, however. These are documented in a separate section below.
Intermediate
The models in the intermediate layer perform a variety of utility and helper transformations for models in the staging and domain layers, and also for a number of Prefect flows (such as for making content recommendations).
Domain
The models in the domain layer represent a consolidated entity-relation data model of the Minds application domain. The models in this layer are intended to either be queried directly, or combined, filtered, and aggregated into task-specific KPI and presentation models.
KPI
The models in the kpi layer are aggregations designed to support dashboards and other reporting of specific areas of business KPIs (e.g. "acquisition", "retention", etc.).
Presentation
The models in the presentation layer are composed from models in lower layers and restructured to facilitate use in interactive reporting and visualisation tools such as Superset.
Events, Sessions, and Aggregation
These staging models form the basis for aggregated processing of user interaction data and application events. The purpose of these models is to abstract the essential data from the fine-grained user-event data generated by Snowplow, so that the original fine-grained data can be discarded according to a data-retention policy without compromising the legitimate product feature and business analytics requirements for the aggregate data.
The models start with the raw snowplow_atomic.events records, and the processed snowplow_atomic.event_contexts records, and build incremental models which provide for the foreseeable downstream data use cases. The models are as follows:
stg_session_events
This model is the starting point for user event aggregation. It uses as data sources the raw events from snowplow_atomic.events, the processed events from snowplow_atomic.events_contexts, and the int_network_user_id_map table, which maps a snowplow network_userid to a native Minds application user_id. This model extracts out the values that are relevant for composing records for the stg_sessions, stg_session_contexts, and stg_entity_contexts models.
The output fields of this model are as follows:
- event_id: the id of the original Snowplow event, as found in
snowplow_atomic.eventsandsnowplow_atomic_event_contexts - collector_tstamp: the collector timestamp on the original Snowplow event
- user_id: the Minds native
user_idif the value was set on the original event, or if it was mapped viaint_network_user_id_map; otherwise it will be thenetwork_useridof the original Snowplow event - event_type: either the value of the
actionattribute if the event is an action event, or the value ofevent_nameotherwise - user_ipaddress: the value or
user_ipaddressfrom the original Snowplow event - useragent: the value of
useragentfrom the original Snowplow event - br_name: the value of
br_namefrom the original Snowplow event - dvce_type: The value of
dvce_typefrom the original Snowplow event - entity_guid: the value of
entity_guidas extracted in the correspondingsnowplow_atomic.event_contextsrecord - event_name: the value of
event_nameas extracted in the correspondingevent_contextsrecord - session_context: a string representing a json map of attribute/value pairs extracted from the snowplow event that are applicable to a user session
- event_context: a string representing a json map of attribute/value pairs extracted from the snowplow event that can be aggregated on a per-session and per-entity basis for downstream analysis
stg_sessions
This model aggregates event information over a single user session, where a user session is defined as all events associated with a single user_id (value from stg_session_events record) within an inteval of 24 hours or less where there is no time delay of more than 30 minutes between two subsequent events. The model aggregates per-session event information which can be further filtered and aggregated for per-user, per-day, per-cohort, etc. reporting and analysis without exposing the specific details of user behaviour on site.
The output fields of this model are as follows:
- user_id: the
user_idvalue used to aggregate records fromstg_session_events - userid_type: the type of user id; possible values are
userfor a native Minds user guid,pseudofor a masked pseudonymous Snowplow user id, ornetworkfor an anonymous (not logged-in) Snowplow-assigned identity - session_number: the ordinal number of this session for the specified
user_id, as found in thestg_session_eventstable - session_day: in the event that a session delimited by a 30-minute inactivity window would last longer than 24 hours, the session is broken down into sub-sessions of no longer than 24 hours each, and assigned a
session_daynumber, starting with0, in chronological order, with thesession_numbervalue staying the same for each sub-session record - network_userid: a Snowflake array value of all
network_useridvalues found on event records associated with this user session; in the normative case, this array will contain one and only one value - user_ipaddress: a Snowflake array value of all
user_ipaddressvalues found on event records associated with this user session; in the normative case, this array will contain one and only one value - useragent: a Snowflake array value of all
useragentvalues found on event records associated with this user session; in the normative case, this array will contain one and only one value - br_name: a Snowflake array value of all
br_namevalues found on event records associated with this user session; in the normative case, this array will contain one and only one value - dvce_type: a Snowflake array value of all
dvce_typevalues found on event records associated with this user session; in the normative case, this array will contain one and only one value - user_first_tstamp: the
collector_tstampvalue of the first event in thestg_session_eventstable with thisuser_idvalue (or, alternately, thefirst_tstampvalue of the first session associated with thisuser_id) - first_tstamp: the
collector_tstampvalue of the first event record associated with this session - last_tstamp: the
collector_tstampvalue of the last event record associated with this session - duration: the time in seconds between the
first_tstampandlast_tstampvalues - event_count: the total count of event records associated with this session
The following fields contain the aggregated count of events of each event_type value associated with this session:
- view_count
- screen_view_count
- page_ping_count
- growthbook_count
- active_count
- vote_up_count
- vote_up_cancel_count
- vote_down_count
- vote_down_cancel_count
- create_count
- delete_count
- remind_count
- comment_count
- login_count
- subscribe_count
- boost_count
- accept_count
- email_confirm_count
- email_click_count
- email_unsubscribe_count
- block_count
- unblock_count
- join_count
- joined_rewards_count
stg_session_contexts
This model aggregates the session_context values from snowplow_atomic.event_contexts that are associated with each user session in stg_sessions.
The output fields of this model are as follows:
- user_id: as in
stg_sessions - session_number: as in
stg_sessions - session_day: as in
stg_sessions - context: the value of the
session_contextfield in thesnowplow_atomic.session_contextstable - first_tstamp: the
collector_tstampvalue of the first event in the associated session that has thiscontextvalue - last_tstamp: the
collector_tstampvalue of the last event in the associated session that has thiscontextvalue - event_count: the count of events associated with this session that have this
contextvalue
stg_entity_contexts
This model aggregates the event_context values from stg_session_events for a particular entity_guid value within each one-hour wall-clock interval.
The output fields of this model are as follows:
- entity_guid: the
entity_guidvalue fromstg_session_events - event_hour: the one-hour wall-clock interval used for aggregation (timestamp of the first second of the hour)
- event_name: the
event_namevalue fromstg_session_events - event_context: the
event_contextvalue fromstg_session_events - first_tstamp: the
collector_tstampvalue of the first aggregated event fromstg_session_eventsthat has thisevent_contextvalue - last_tstamp: the
collector_tstampvalue of the last aggregated event fromstg_session_eventsthat has thisevent_contextvalue - event_count: the count of aggregated events that have this
event_contextvalue