Skip to main content

Minds Dbt Project Documentation

Introduction

The Minds dbt project is structured as a series of transformation layers which start with raw data and progress toward data suitable for end use in business analytics and product features. The rough flow is as follows:

  • Data sources 🡒 Data staging 🡒 Intermediate processing 🡒 Domain models 🡒 KPI models and presentation models

For any models which require non-trivial incremental processing, there are also "checkpoint" models which freeze the state of an incremental model as of the last dbt run (and circumvent dbt reference cycle detection).

Operations

To work with the Minds dbt project, first make sure you have your environment set up and running as explained in the dbt introduction here

Bootstrap vs Incremental vs Full Rebuild

Dbt has basic built-in support for incremental processing of models (i.e. the entire database table will not need to be rebuilt from scratch each time dbt runs). This is particularly important when the tables are very large and the processing to create them is complex and compute-intensive (which is the case for a number of models in the Minds project). However, the built-in support has certain limitations, particularly when the output of a model from a previous run is used as an upstream input to same the model in a subsequent incremental run (a circular dependency). This situation arises for a number of Minds models.

In order to process these sorts of models, the Minds dbt project employs techniques to manage certain circular dependencies that arise. This is managed through "checkpoint" models (found in the dbt/models/checkpoint folder) and conditional inclusion blocks in the models that need to resolve circular dependencies for incremental processing.

One consequence is that a bootstrap process is required the first time a new (empty) schema is created from the dbt project. This creates the initial tables and checkpoints required to start incremental processing.

To "bootstrap", that is, to start a new schema (e.g. for a new developer), first run the following dbt command (this will take approximately 3 minutes to complete):

(dbtenv) $ dbt run --full-refresh --select \
chk_last_contexts_full \
chk_last_entity_contexts_full \
chk_last_sessions_full \
chk_network_user_mapping_full \
chk_tstamp_interval_full \
stg_session_contexts \
stg_entity_contexts \
stg_sessions \
int_network_user_id_map \
stg_session_events

This will create the database tables required for incremental processing, and will process the first 28 days of event data found in the snowplow_atomic.events source table. After this completes, subsequent incremental runs (the default behaviour of dbt run) will process up to 28 days worth of additional event data from the snowplow_atomic.events table until the incremental tables in the new schema have caught up to the latest records in the snowplow_atomic.events table.

If at some point it becomes necessary to rebuild an existing developer schema from scratch, add the --full-refresh directive to the dbt command line as follows:

(dbtenv) $ dbt run --full-refresh

This will rebuild all tables and set the entire project back to the initial 28-day processing of snowplow event data.

Developer vs Production

The default behavior of dbt is to run the project with the dev target, and process the tables in the corresponding schema configured in the profiles.yml file. If it becomes necessary to manually run the project (or an updated model) in the production schema (not recommended), this can be done by adding the -t prod target specification to the command line as follows:

(dbtenv) $ dbt run -t prod

This will update the tables in the production schema, dbt_prod, directly

Transformation Layers

Each layer has its own directory under minds-etl/dbt/models

Sources

The Minds dbt project has six data sources it uses as starting points for the data transformation model. Data from these sources is loaded into Snowflake via Prefect ETL flows prior to running the dbt project. The following files in the dbt/models/sources directory specify the names and locations of each source table used in the project:

  • cassandra.yml: tables imported from cassandra, including separate tables split out from the cassandra entities table
  • elasticsearch.yml: "engagement events" extracted from ElasticSearch server
  • mautic.yml: tables of email campaign data extracted via the Mautic API
  • snowplow.yml: raw Snowplow event data (events) and processed contextual data for aggregation (event_contexts)
  • stripe.yml: tables of financial transaction data via the Stripe API
  • vitess.yml: tables extracted from the vitess application database (e.g. boosts) that are used for analysis

Staging

The staging models are, in most cases, slightly cleaned-up (deduplicated, slowly-changing dimensioned) versions of the source data. A few models perform substantive transformations, however. These are documented in a separate section below.

Intermediate

The models in the intermediate layer perform a variety of utility and helper transformations for models in the staging and domain layers, and also for a number of Prefect flows (such as for making content recommendations).

Domain

The models in the domain layer represent a consolidated entity-relation data model of the Minds application domain. The models in this layer are intended to either be queried directly, or combined, filtered, and aggregated into task-specific KPI and presentation models.

KPI

The models in the kpi layer are aggregations designed to support dashboards and other reporting of specific areas of business KPIs (e.g. "acquisition", "retention", etc.).

Presentation

The models in the presentation layer are composed from models in lower layers and restructured to facilitate use in interactive reporting and visualisation tools such as Superset.

Events, Sessions, and Aggregation

These staging models form the basis for aggregated processing of user interaction data and application events. The purpose of these models is to abstract the essential data from the fine-grained user-event data generated by Snowplow, so that the original fine-grained data can be discarded according to a data-retention policy without compromising the legitimate product feature and business analytics requirements for the aggregate data.

The models start with the raw snowplow_atomic.events records, and the processed snowplow_atomic.event_contexts records, and build incremental models which provide for the foreseeable downstream data use cases. The models are as follows:

stg_session_events

This model is the starting point for user event aggregation. It uses as data sources the raw events from snowplow_atomic.events, the processed events from snowplow_atomic.events_contexts, and the int_network_user_id_map table, which maps a snowplow network_userid to a native Minds application user_id. This model extracts out the values that are relevant for composing records for the stg_sessions, stg_session_contexts, and stg_entity_contexts models.

The output fields of this model are as follows:

  • event_id: the id of the original Snowplow event, as found in snowplow_atomic.events and snowplow_atomic_event_contexts
  • collector_tstamp: the collector timestamp on the original Snowplow event
  • user_id: the Minds native user_id if the value was set on the original event, or if it was mapped via int_network_user_id_map; otherwise it will be the network_userid of the original Snowplow event
  • event_type: either the value of the action attribute if the event is an action event, or the value of event_name otherwise
  • user_ipaddress: the value or user_ipaddress from the original Snowplow event
  • useragent: the value of useragent from the original Snowplow event
  • br_name: the value of br_name from the original Snowplow event
  • dvce_type: The value of dvce_type from the original Snowplow event
  • entity_guid: the value of entity_guid as extracted in the corresponding snowplow_atomic.event_contexts record
  • event_name: the value of event_name as extracted in the corresponding event_contexts record
  • session_context: a string representing a json map of attribute/value pairs extracted from the snowplow event that are applicable to a user session
  • event_context: a string representing a json map of attribute/value pairs extracted from the snowplow event that can be aggregated on a per-session and per-entity basis for downstream analysis

stg_sessions

This model aggregates event information over a single user session, where a user session is defined as all events associated with a single user_id (value from stg_session_events record) within an inteval of 24 hours or less where there is no time delay of more than 30 minutes between two subsequent events. The model aggregates per-session event information which can be further filtered and aggregated for per-user, per-day, per-cohort, etc. reporting and analysis without exposing the specific details of user behaviour on site.

The output fields of this model are as follows:

  • user_id: the user_id value used to aggregate records from stg_session_events
  • userid_type: the type of user id; possible values are user for a native Minds user guid, pseudo for a masked pseudonymous Snowplow user id, or network for an anonymous (not logged-in) Snowplow-assigned identity
  • session_number: the ordinal number of this session for the specified user_id, as found in the stg_session_events table
  • session_day: in the event that a session delimited by a 30-minute inactivity window would last longer than 24 hours, the session is broken down into sub-sessions of no longer than 24 hours each, and assigned a session_day number, starting with 0, in chronological order, with the session_number value staying the same for each sub-session record
  • network_userid: a Snowflake array value of all network_userid values found on event records associated with this user session; in the normative case, this array will contain one and only one value
  • user_ipaddress: a Snowflake array value of all user_ipaddress values found on event records associated with this user session; in the normative case, this array will contain one and only one value
  • useragent: a Snowflake array value of all useragent values found on event records associated with this user session; in the normative case, this array will contain one and only one value
  • br_name: a Snowflake array value of all br_name values found on event records associated with this user session; in the normative case, this array will contain one and only one value
  • dvce_type: a Snowflake array value of all dvce_type values found on event records associated with this user session; in the normative case, this array will contain one and only one value
  • user_first_tstamp: the collector_tstamp value of the first event in the stg_session_events table with this user_id value (or, alternately, the first_tstamp value of the first session associated with this user_id)
  • first_tstamp: the collector_tstamp value of the first event record associated with this session
  • last_tstamp: the collector_tstamp value of the last event record associated with this session
  • duration: the time in seconds between the first_tstamp and last_tstamp values
  • event_count: the total count of event records associated with this session

The following fields contain the aggregated count of events of each event_type value associated with this session:

  • view_count
  • screen_view_count
  • page_ping_count
  • growthbook_count
  • active_count
  • vote_up_count
  • vote_up_cancel_count
  • vote_down_count
  • vote_down_cancel_count
  • create_count
  • delete_count
  • remind_count
  • comment_count
  • login_count
  • subscribe_count
  • boost_count
  • accept_count
  • email_confirm_count
  • email_click_count
  • email_unsubscribe_count
  • block_count
  • unblock_count
  • join_count
  • joined_rewards_count

stg_session_contexts

This model aggregates the session_context values from snowplow_atomic.event_contexts that are associated with each user session in stg_sessions.

The output fields of this model are as follows:

  • user_id: as in stg_sessions
  • session_number: as in stg_sessions
  • session_day: as in stg_sessions
  • context: the value of the session_context field in the snowplow_atomic.session_contexts table
  • first_tstamp: the collector_tstamp value of the first event in the associated session that has this context value
  • last_tstamp: the collector_tstamp value of the last event in the associated session that has this context value
  • event_count: the count of events associated with this session that have this context value

stg_entity_contexts

This model aggregates the event_context values from stg_session_events for a particular entity_guid value within each one-hour wall-clock interval.

The output fields of this model are as follows:

  • entity_guid: the entity_guid value from stg_session_events
  • event_hour: the one-hour wall-clock interval used for aggregation (timestamp of the first second of the hour)
  • event_name: the event_name value from stg_session_events
  • event_context: the event_context value from stg_session_events
  • first_tstamp: the collector_tstamp value of the first aggregated event from stg_session_events that has this event_context value
  • last_tstamp: the collector_tstamp value of the last aggregated event from stg_session_events that has this event_context value
  • event_count: the count of aggregated events that have this event_context value