WHT Project¶
Folder Structure¶
wht_project.yaml¶
The project yaml file contains name, schema version, profile and entities name along with id types.
Here is an example of the yaml file:
# Project name
name: sample_attribution
# Project's yaml schema version
schema_version: 4
# WH Connection Profile to use.
profile: test
# Model folders to use.
model_folders:
- models
# Entities in this project and their ids.
entities:
- name: user
id_types:
- name: main_id
- name: salesforce_id
- name: rudder_anon_id
filters:
- type: exclude
value: ""
- type: exclude
value: "unknown"
- type: exclude
value: "NaN"
- name: email
filters:
- type: include
regex: ".+@.+"
- type: exclude
value: "test@company.com"
- name: domain
Let’s look at each field to know more about it.
name
It is the name of the project that you want to give. |
|
type |
string |
schema_version
It is project yaml version. Each new version of schema comes with improvement and added functionality. You can check the latest version of wht here. |
|
type |
int |
profile
Selects one of the connection profiles defined in siteconfig.yaml to connect to the warehouse. Specify name of the connection profile to use. More details in siteconfig.yaml. |
|
type |
string |
model_folders
Folder where model files are stored. |
|
type |
string |
entities
Lists all entities used in the project. An entity in DBMS is a real-world thing or a real-world object which is distinguishable from other such objects. Each entity has atleast one ID. Each entry for an entity here is a JSON object specifying entity’s name and attributes. |
||||
type: list |
||||
properties |
||||
|
Name of the entity used in the project |
|||
type |
string |
|||
|
An entity may be associated with multiple ID types. Here we list all ID types associated with the current entity. Each entry is an object with ID type’s name and attributes. Each ID type is linked only with a single entity. |
|||
type: list |
||||
properties |
||||
|
Name of the ID type. |
|||
type |
string |
|||
|
We use filters to exclude or include some values from id types. Whenever the current Id type appears, it may need to be filtered or transformed. filters section contains type, value, and regex expression of the filter. |
|||
type: list constraint: At a time, only one of regex or value field should be set. |
||||
properties |
||||
|
Valid filter types are ‘include’ and ‘exclude’. Include filters accept the IDs on match. Exclude reject the IDs on match. |
|||
type |
string |
|||
|
A value to match. Usually used to reject certain invalid ID values like ‘NaN’, ‘unknown’, ‘test@domain.com’ etc. |
|||
type |
string |
|||
|
Regex to match the values with. May be used to both accept or reject matching IDs depending upon requirement. |
|||
type |
regex string |
Inputs¶
A input file outlines how to obtain values for models from various tables. We specify the table along with column name and SQL expression for retrieving values from various tables.:
inputs:
- name: salesforceTasks
table: salesforce.task
timestamp_col: activitydate
ids:
# column name or sql expression
- sql: "whoid"
type: salesforce_id
- name: salesforceContact
table: salesforce.contact
timestamp_col: createddate
ids:
- sql: "id"
type: salesforce_id
- sql: "case when lower(email) like any ('%gmail%', '%yahoo%') then lower(email) else split_part(lower(email),'@',2) end"
type: email
- name: websitePageVisits
table: autotrack.pages
timestamp_col: timestamp
ids:
- sql: "anonymous_id"
type: rudder_anon_id
# below sql expression check the email type, if it is gmail and yahoo return email otherwise spilt email return domain of email.
- sql: "case when lower(coalesce(context_traits_email, user_id)) like any ('%gmail%', '%yahoo%') then lower(coalesce(context_traits_email, user_id)) \
else split_part(lower(coalesce(context_traits_email, user_id)),'@',2) end"
type: email
In the inputs file we specify all the input tables on which the project depends. Input specification may also include metadata, including constraints on those columns.
Let’s look at schema of input YAML files to know more about them.
In each object under the inputs key, we specify the input tables on which the project depends. Additional information like column metadata including constraints could be specified. |
|||
type |
object |
||
properties |
|||
|
A friendly name to be given to the input model. |
||
type |
string |
||
|
Mention the table in data warehouse that one is referring from. The table may be prefixed with external schema or database in the same warehouse, if applicable |
||
type |
string |
||
|
The special |
||
type |
string |
||
|
Some of the input columns may contain IDs of associated entities. By their presence, such ID columns associate the row with the entity of the ID. The Id Stitcher may use these ID declarations to automatically discover ID to ID Edges. Here, we specify list of all IDs from the source table that are present, along with their column names(or column SQL expressions). |
||
type: list |
|||
properties |
|||
|
Use column name to use the value in the column as ID. An SQL expression may also be specified, if some transformation is required. |
||
type |
string |
||
|
The Id type to use. All the ID types of a project are declared in wht_project.yaml and can specify additional filters on the column expression. Each ID type is linked only with a single entity and determines the entity associated. |
||
type |
string |
Models¶
A model describes the steps towards creating an output table and the input sources whose data is used during those steps. A model file in YAML describes one or more models. The following YAML describes two models. domain_profile_id_stitcher is an ID stitcher, while domain_profile is a feature table model.
models:
- name: domain_profile_id_stitcher
model_type: id_stitcher
model_spec:
validity_time: 24h # 1 day
entity_key: user
main_id_type: main_id
edge_sources:
- input: salesforceTasks
- input: salesforceContact
- input: websitePageVisits
- input: webhookSource
- input: websiteSource
- input: blacklistDomains
- input: domainSummary
- input: webappIdentifies
- input: salesForceLeadsTable
- input: autotrackIdentifies
- name: domain_profile
model_type: feature_table_model
model_spec:
validity_time: 24h # 1 day
id_stitching:
model: domain_profile_id_stitcher
features:
- tablefeature:
name: is_mql
ref:
input: salesForceLeadsTable
value: max(case when salesForceLeadsTable.mql__c == 'True' then 1 else 0 end)
description: Whether a domain is mql or not
- tablevar:
name: blacklistFlag
ref:
input: blacklistDomains
filter: (context_sources_job_run_id = (select top 1 context_sources_job_run_id from blacklistDomains order by timestamp desc))
value: max(case when exclude_reason is not null then 1 else 0 end)
- tablevar:
name: domainSummary_account_type
ref:
input: domainSummary
value: domainSummary.account_type
- feature:
name: ignore_domain
value: max(case when blacklistFlag = 1 or domainSummary_account_type like '%free%' then 1 else 0 end)
description: Whether a domain should be ignored for the analysis
- macro:
name: call_conversion
value: type = 'Call' or tasksubtype='Call' or type like '%demo%') and \
(rlike(lower(subject), '.*\\bcall\\b.*') or rlike(lower(subject), '.*\\bdemo\\b.*'
- tablevar:
name: salesCallEventTime
ref:
input: salesforceTasks
value: min(activitydate)
filter: macro(call_conversion)
- feature:
name: sales_conversion_timestamp
value: salesCallEventTime
- tablevar:
name: salesEvents
ref:
input: salesforceTasks
value: json_agg(activitydate, case when (type='Email' or tasksubtype = 'Email') then case when lower(subject) like '%[in]%' then 'sf_inbound_email' \
else 'sf_outbound_email' end when macro(call_conversion) then 'sf_call' else null end as event)
description: Salesforce touches are converted to one of following events - sf_inbound_email, sf_outbound_email, sf_call, null
- tablevar:
name: page_events_var
ref:
input: websitePageVisits
filter: timestamp < sales_conversion_timestamp and timestamp > var('start_date')
value: json_agg(timestamp, case when split_part(trim(context_page_path, '/'), '/', 1) in var('valid_page_paths') then split_part(trim(context_page_path, '/'), '/', 1) \
else var('default_page_path') end as event)
- feature:
name: page_events
value: page_events_var
- tablevar:
name: webhookFormSubmit
ref:
input: webhookSource
filter: variable_1 is null and timestamp < sales_conversion_timestamp and timestamp > var('start_date')
value: min(timestamp)
- feature:
name: formSubmit
value: min(webhookFormSubmit, websiteFormSubmit)
Let’s look at schema of model YAML files to know more about them.
In each object under the models key, we specify the complete recipe to create the model required by the project. |
||
type |
object |
|
properties |
||
|
A friendly name to be given to the input model. |
|
type |
string |
|
|
Type determines the broad category of the model to be constructed. Currently, only two model types are supported - ‘id_stitcher’ and ‘feature_table_model’. |
|
type |
string |
|
|
In the model spec, one can create a detailed configuration spec for the target model. Different schema is applicable for model_spec of different models. Check out the next section for detailed schema of supported models. |
|
type |
object |
Supported model types
The different kinds of models that can be created are:
ID Stitcher - Data comes from different sources and different sources may assign different IDs. To be able to track a user’s journey uniquely across all these data sources, we need to stitch together all these IDs. Id stitching helps map different IDs of the same user to a single canonical ID. It does this by doing connected component analysis over the Id-Id edge graph specified in its configuration. Click here for syntax.
Feature Table - This allows you to create output tables with computations such as aggregation, average, etc. on selected fields, click here for syntax.
ML Notebooks - You can take ML Notebooks in Jupyter and add ML/AI capabilities to it.
External Feature Tables - Fetch data from your existing tables in the warehouse that are created by sources like DBT and Airbyte.
Access Control¶
Access control privileges determine which user can access to which schemas/materials/registries and run which models in a warehouse. As it is directly related to the data available in the warehouse, and all the warehouses have their own access control mechanisms, the model(DAC/RBAC) used for access control depends on the warehouse also.
Snowflake
uses a combination of DAC and RBAC models for access control. Due to the use case of
WHT where multiple users can launch the WHT CLI, we have gone with the RBAC based access control
mechanism. Apart from this, we do not want the result of an Individual user run to be tied with
that user. So we suggest creating a role (say RS360_RUNNER_ROLE
) with the privileges explained in this
document.
The role RS360_RUNNER_ROLE
should have read access to all the inputs to the model, if
multiple schemas/tables are involved then the same role can be shared. A run will create
some materials so the role RS360_RUNNER_ROLE
must have write access to those schemas and the
common tables also e.g. Registry.
If these privileges are not assigned then the run will fail. If you want to access any material
created during run or as a result of the run, the role RS360_RUNNER_ROLE
should have read access to all
of those schemas.
wht validate access
does a check of whether the used role has access to read from all the input
tables, write to schema and common in order to store the output of material and other information
. If the privileges assigned to the role are sufficient then the project can be executed. for
more details, please refer to validate.
Timestamp of WHT Model Inputs¶
Whenever RudderStack Events or Cloud Extract load data on a warehouse, then it mentions timestamp in a column specifying the date and time the data was loaded. Say you want your models to fetch data that was loaded in the last 24 hours, then specify validity_time: 24h
in the model YAML file.
In case you wish to fetch all the data from source tables irrespective of timestamp, then you can add timeless parameter to the query.
wht run -t timeless