WHT Project

Folder Structure

HelloWhtProject
├── wht-project.yaml
└── models
└── inputs.yaml
└── ml-features.yaml

wht_project.yaml

The project yaml file contains name, schema version, profile and entities name along with id types.

Here is an example of the yaml file:

# Project name
name: sample_attribution

# Project's yaml schema version
schema_version: 4

# WH Connection Profile to use.
profile: test

# Model folders to use.
model_folders:
  - models

# Entities in this project and their ids.
entities:
  - name: user
    id_types:
      - name: main_id
      - name: salesforce_id
      - name: rudder_anon_id
        filters:
          - type: exclude
            value: ""
          - type: exclude
            value: "unknown"
          - type: exclude
            value: "NaN"
      - name: email
        filters:
          - type: include
            regex: ".+@.+"
          - type: exclude
            value: "test@company.com"
      - name: domain

Let’s look at each field to know more about it.

name

It is the name of the project that you want to give.

type

string

schema_version

It is project yaml version. Each new version of schema comes with improvement and added functionality.

You can check the latest version of wht here.

type

int

profile

Selects one of the connection profiles defined in siteconfig.yaml to connect to the warehouse.

Specify name of the connection profile to use. More details in siteconfig.yaml.

type

string

model_folders

Folder where model files are stored.

type

string

entities

Lists all entities used in the project.

An entity in DBMS is a real-world thing or a real-world object which is distinguishable from other such objects.

Each entity has atleast one ID.

Each entry for an entity here is a JSON object specifying entity’s name and attributes.

type: list

properties

  • name

Name of the entity used in the project

type

string

  • id_types

An entity may be associated with multiple ID types.

Here we list all ID types associated with the current entity.

Each entry is an object with ID type’s name and attributes.

Each ID type is linked only with a single entity.

type: list

properties

  • name

Name of the ID type.

type

string

  • filters

We use filters to exclude or include some values from id types.

Whenever the current Id type appears, it may need to be filtered or transformed.

filters section contains type, value, and regex expression of the filter.

type: list

constraint: At a time, only one of regex or value field should be set.

properties

  • type

Valid filter types are ‘include’ and ‘exclude’.

Include filters accept the IDs on match.

Exclude reject the IDs on match.

type

string

  • value

A value to match.

Usually used to reject certain invalid ID values like ‘NaN’, ‘unknown’, ‘test@domain.com’ etc.

type

string

  • regex

Regex to match the values with.

May be used to both accept or reject matching IDs depending upon requirement.

type

regex string

Inputs

A input file outlines how to obtain values for models from various tables. We specify the table along with column name and SQL expression for retrieving values from various tables.:

inputs:
  - name: salesforceTasks
    table: salesforce.task
    timestamp_col: activitydate
    ids:
      # column name or sql expression
      - sql: "whoid"
        type: salesforce_id
  - name: salesforceContact
    table: salesforce.contact
    timestamp_col: createddate
    ids:
      - sql: "id"
        type: salesforce_id
      - sql: "case when lower(email) like any ('%gmail%', '%yahoo%') then lower(email)  else split_part(lower(email),'@',2) end"
        type: email
  - name: websitePageVisits
    table: autotrack.pages
    timestamp_col: timestamp
    ids:
      - sql: "anonymous_id"
        type: rudder_anon_id
      # below sql expression check the email type, if it is gmail and yahoo return email otherwise spilt email return domain of email.
      - sql: "case when lower(coalesce(context_traits_email, user_id)) like any ('%gmail%', '%yahoo%') then lower(coalesce(context_traits_email, user_id))  \
            else split_part(lower(coalesce(context_traits_email, user_id)),'@',2) end"
        type: email

In the inputs file we specify all the input tables on which the project depends. Input specification may also include metadata, including constraints on those columns.

Let’s look at schema of input YAML files to know more about them.

In each object under the inputs key, we specify the input tables on which the project depends.

Additional information like column metadata including constraints could be specified.

type

object

properties

  • name

A friendly name to be given to the input model.

type

string

  • table

Mention the table in data warehouse that one is referring from. The table may be prefixed with external schema or database in the same warehouse, if applicable

type

string

  • timestamp_col

The special timestamp column can be specified here. The timestamp column is used in case of event tables to time the event

type

string

  • ids

Some of the input columns may contain IDs of associated entities.

By their presence, such ID columns associate the row with the entity of the ID.

The Id Stitcher may use these ID declarations to automatically discover ID to ID Edges. Here, we specify list of all IDs from the source table that are present,

along with their column names(or column SQL expressions).

type: list

properties

  • sql

Use column name to use the value in the column as ID.

An SQL expression may also be specified, if some transformation is required.

type

string

  • type

The Id type to use.

All the ID types of a project are declared in wht_project.yaml and can specify additional filters on the column expression.

Each ID type is linked only with a single entity and determines the entity associated.

type

string

Models

A model describes the steps towards creating an output table and the input sources whose data is used during those steps. A model file in YAML describes one or more models. The following YAML describes two models. domain_profile_id_stitcher is an ID stitcher, while domain_profile is a feature table model.

models:
  - name: domain_profile_id_stitcher
    model_type: id_stitcher
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      main_id_type: main_id
      edge_sources:
        - input: salesforceTasks
        - input: salesforceContact
        - input: websitePageVisits
        - input: webhookSource
        - input: websiteSource
        - input: blacklistDomains
        - input: domainSummary
        - input: webappIdentifies
        - input: salesForceLeadsTable
        - input: autotrackIdentifies
  - name: domain_profile
    model_type: feature_table_model
    model_spec:
      validity_time: 24h # 1 day
      id_stitching:
        model: domain_profile_id_stitcher
      features:
        - tablefeature:
            name: is_mql
            ref:
              input: salesForceLeadsTable
            value: max(case when salesForceLeadsTable.mql__c == 'True' then 1 else 0 end)
            description: Whether a domain is mql or not
        - tablevar:
            name: blacklistFlag
            ref:
              input: blacklistDomains
            filter: (context_sources_job_run_id = (select top 1 context_sources_job_run_id from blacklistDomains order by timestamp desc))
            value: max(case when exclude_reason is not null then 1 else 0 end)
        - tablevar:
            name: domainSummary_account_type
            ref:
              input: domainSummary
            value: domainSummary.account_type
        - feature:
            name: ignore_domain
            value: max(case when blacklistFlag = 1 or domainSummary_account_type like '%free%' then 1 else 0 end)
            description: Whether a domain should be ignored for the analysis
        - macro:
            name: call_conversion
            value: type = 'Call' or tasksubtype='Call' or type like '%demo%') and \
                  (rlike(lower(subject), '.*\\bcall\\b.*') or rlike(lower(subject), '.*\\bdemo\\b.*'
        - tablevar:
            name: salesCallEventTime
            ref:
              input: salesforceTasks
            value: min(activitydate)
            filter: macro(call_conversion)
        - feature:
            name: sales_conversion_timestamp
            value: salesCallEventTime
        - tablevar:
            name: salesEvents
            ref:
              input: salesforceTasks
            value: json_agg(activitydate, case when (type='Email' or tasksubtype = 'Email') then case when lower(subject) like '%[in]%' then 'sf_inbound_email' \
                  else 'sf_outbound_email' end when macro(call_conversion) then 'sf_call' else null end as event)
            description: Salesforce touches are converted to one of following events - sf_inbound_email, sf_outbound_email, sf_call, null
        - tablevar:
            name: page_events_var
            ref:
              input: websitePageVisits
            filter: timestamp < sales_conversion_timestamp and timestamp > var('start_date')
            value: json_agg(timestamp, case when split_part(trim(context_page_path, '/'), '/', 1) in var('valid_page_paths') then split_part(trim(context_page_path, '/'), '/', 1) \
                  else var('default_page_path') end as event)
        - feature:
            name: page_events
            value: page_events_var
        - tablevar:
            name: webhookFormSubmit
            ref:
              input: webhookSource
            filter: variable_1 is null and timestamp < sales_conversion_timestamp and timestamp > var('start_date')
            value:  min(timestamp)
        - feature:
            name: formSubmit
            value: min(webhookFormSubmit, websiteFormSubmit)

Let’s look at schema of model YAML files to know more about them.

In each object under the models key, we specify the complete recipe to create the model required by the project.

type

object

properties

  • name

A friendly name to be given to the input model.

type

string

  • model_type

Type determines the broad category of the model to be constructed.

Currently, only two model types are supported - ‘id_stitcher’ and ‘feature_table_model’.

type

string

  • model_spec

In the model spec, one can create a detailed configuration spec for the target model.

Different schema is applicable for model_spec of different models.

Check out the next section for detailed schema of supported models.

type

object

Supported model types

The different kinds of models that can be created are:

  • ID Stitcher - Data comes from different sources and different sources may assign different IDs. To be able to track a user’s journey uniquely across all these data sources, we need to stitch together all these IDs. Id stitching helps map different IDs of the same user to a single canonical ID. It does this by doing connected component analysis over the Id-Id edge graph specified in its configuration. Click here for syntax.

  • Feature Table - This allows you to create output tables with computations such as aggregation, average, etc. on selected fields, click here for syntax.

  • ML Notebooks - You can take ML Notebooks in Jupyter and add ML/AI capabilities to it.

  • External Feature Tables - Fetch data from your existing tables in the warehouse that are created by sources like DBT and Airbyte.

Access Control

Access control privileges determine which user can access to which schemas/materials/registries and run which models in a warehouse. As it is directly related to the data available in the warehouse, and all the warehouses have their own access control mechanisms, the model(DAC/RBAC) used for access control depends on the warehouse also.

Snowflake uses a combination of DAC and RBAC models for access control. Due to the use case of WHT where multiple users can launch the WHT CLI, we have gone with the RBAC based access control mechanism. Apart from this, we do not want the result of an Individual user run to be tied with that user. So we suggest creating a role (say RS360_RUNNER_ROLE) with the privileges explained in this document.

The role RS360_RUNNER_ROLE should have read access to all the inputs to the model, if multiple schemas/tables are involved then the same role can be shared. A run will create some materials so the role RS360_RUNNER_ROLE must have write access to those schemas and the common tables also e.g. Registry.

If these privileges are not assigned then the run will fail. If you want to access any material created during run or as a result of the run, the role RS360_RUNNER_ROLE should have read access to all of those schemas.

wht validate access does a check of whether the used role has access to read from all the input tables, write to schema and common in order to store the output of material and other information . If the privileges assigned to the role are sufficient then the project can be executed. for more details, please refer to validate.

Timestamp of WHT Model Inputs

Whenever RudderStack Events or Cloud Extract load data on a warehouse, then it mentions timestamp in a column specifying the date and time the data was loaded. Say you want your models to fetch data that was loaded in the last 24 hours, then specify validity_time: 24h in the model YAML file.

In case you wish to fetch all the data from source tables irrespective of timestamp, then you can add timeless parameter to the query.

wht run -t timeless