Skip to main content

Data Pipeline Overview

The main components of the Minds ETL data pipeline are as follows:

  • Data sources: these are sources of event and entity state data from various internal (e.g. Snowplow app events) and external (e.g. Stripe transactions)
  • Data sinks: these are destinations for processed data and include the Snowflake data warehouse, the Vitess application database, the Elastic search engine, and various Pulsar pub/sub event streams
  • Extract/Transform/Load (ETL) scripts: these are scripts usually written in Python, which usually run under the Prefect orchestration manager, but also may run as standalone scripts running in containers in the production infrastructure (e.g. Img To Caption), or as Airbyte connectors
  • Data transformation scripts: these are dbt scripts which transform the unprocessed data loaded from various sources into the Snowflake data warehouse into a comprehensive data model useful for business analytics and product functionality

Repositories

minds-etl

This is the main repository for the ETL pipeline. It consists of three main sub-directories: one for creating the run-time docker containers for the Prefect and environment, one for the dbt transformation models, and one for the "flows", which are the orchestrated python scripts that perfom extract, transform and load functions.

The repository structure is as follows:

  • minds-etl
    • dbt (dbt transformation models)
      • dbtproject.yml (configuration of the overall dbt project)_
      • macros (miscellaneous dbt support macros)
      • models (the dbt data transformation models)
        • sources (specification of external data sources used by models in the dbt project)
        • staging (models dealing with initial ingestion/preprocessing of source data)
        • checkpoint _(models that set high-water marks for processed data between runs to enable incremental processing )
        • intermediate (models that perform miscellaneous involved transformations between staging and domain models)
        • domain (models that consolidate data from various sources into a domain-specific entity-relation model)
        • kpi (models that peform various statistical aggregations over the domain models, which can be used in dashboards or presentation models)
        • presentation (models intended to show a particular view of the domain model and or kpi statistics, for use in data visualisation tools such as superset)
    • docker (configuration and setup of the run-time docker containers for the Prefect and Airbyte environments)
      • docker-compose.yml (very important file which defines the configuration and orchestration of the different docker containers used by Prefect and Airbyte)
      • prefect (configuration, setup, and dependencies for the Prefect "agent" container)
        • setup.sh (initialization shell script for the Prefect agent container)
        • requirements.txt (python dependencies for the prefect "flow" scripts, including libraries needed to connect to various data sources/sinks, etc.)
        • config.toml (setup configs for Prefect operating mode: server vs cloud)
        • mautic-0.1.0-py38-none-any.whl (python library dependency to connect to Mautic)
      • docker-entrypoint-initdb.d
        • docker-initialize-multiple-databases.sh (helper script to initialize a single Postgresql with multiple application databases, e.g. Prefect and Airbyte)
    • prefect (directory containing all the various ETL scripts ("flows") which are managed by the Prefect environment)

etl

This is a public-facing cleaned-up version of minds-etl, with commit history curated to make the code easier to follow.

img-to-caption

This is a standalone docker-compose environment that generates captions from images posted to minds, and generates topic category tags from the activity text and generated captions. The main script is caption-images.py which performs the following operations:

  • Listen to the entities-ops pulsar stream
  • When a new activity is identified on the stream, it is checked for image or video attachments
  • If there are image attachments, they are loaded from the CDN, if there is a video attachement the thumbnail is loaded from the CDN
  • A BLIP-2 transformer model is used to generate a caption for each of the image(s)
  • A record for the activity and a record for each caption is published to the captioned-activities pulsar stream
  • The body text, tags, and the captions for an activity are consolidated into a single text summary, which is fed to a trained LLM to generate a list of topical category tags
  • A record for the activity including the text summary and list of topical category tags are published to the inferred-tags pulsar stream

The /terraform directory contains the docker-compose and terraform configs and setup scripts for provisioning a server to run the caption-images.py script in production. Because of the AI neural models used by this script it must be specially provisioned on a GPU-accellerated server instance with the appropriate hardware specifications.