Tutorials

Migrate your existing project

You can use the migrate command to migrate your project to the latest schema.

Auto migrate

Here, pb will automatically migrate your project to the newest schema version, without requiring any manual efforts by the user.

To do so: Navigate to the folder where your project files are stored. Then execute one of the following:

pb migrate auto --inplace (to replace contents of existing folder with the migrated one)

OR

pb migrate auto -d NewFolder (store migrated project files in a folder named NewFolder)

You will see a message on screen such as “Project migrated from version 30 to version 44”, confirming that migration is now complete.

Note

migrate auto works on schema 18 or above.

Manual migrate

Here, pb will list all the steps you can follow to migrate your project.

To do so: Execute command pb migrate manual and follow the suggested steps.

To manually migrate your project from Schema 42 to 43:

Kindly use {{entity-name.Var(var-name)}} to refer to an entity-var or an input-var. For example, for entity_var user_lifespan in HelloPbProject, change select: last_seen - first_seen to select: '{{user.Var("last_seen")}} - {{user.Var("first_seen")}}'.

Note: 1. Two curly brackets are a must. 2. If anything starts with double curly brackets then has to be inside double quotes “”. If we use single quotes inside double quotes then we have to use escape character () which comes when using macros.

Linear dependency - This parameter is to be specified when entity as vars migration is not done (till version 43). After that is done, it’s not necessary to mention these, and can be removed.

compatibility_mode:
  linear_dependency_of_vars: true

SQL Template Models

SQL Template Models allow users to create models with custom sql. It aims at increasing the capability of end user to build complex models with custom SQL syntax. SQL Template models can be used as an input to an entity-var/ input-var in feature table or as an edge-source in id-stitcher.

Example UseCases:

  1. Provide capability to construct more complex models which is difficult otherwise. For example, we can create a model using FLATTEN json data, JOIN tables etc. on any existing table or set of tables in the warehouse. Or we can perform joins over multiple tables to get some output which would not be possible otherwise.

  2. In a PB project, customers can use models from imported library packages. In that case, there might be deviation in table names and column names present in warehouse vs the table name and column name assumed in package models. SQL Models can be used to remap fields to handle deviations.

A SQL Template Model can be of 2 types:

  • single_sql: It should be a single select statement.

  • multi_sql: It can have multiple sql statements. But eventually, it should have a create statement which creates the required table.

Below, there are examples of both of them.

single_sql:

models:
- name: test_sql
  model_type: sql_template
  model_spec:
    validity_time: 24h# 1 day
    materialization:                 // optional
      output_type: ephemeral         // optional [table, view, ephemeral]
      run_type: discrete             // optional [discrete, incremental]
    single_sql: |
        {%- with input1 = this.DeRef("inputs/tbl_a") -%}
          select id1 as new_id1, id2 as new_id2, {{input1}}.*
            from {{input1}}
        {%- endwith -%}
    occurred_at_col: insert_ts        // optional
    ids:
      - select: "new_id1"
        type: test_id
        entity: user
      - select: "new_id2"
        type: test_id
        entity: user
      - select: "id3"
        type: test_id
        entity: user

multi_sql:

models:
- name: test_sql2
    model_type: sql_template
    model_spec:
      validity_time: 24h # 1 day
      materialization:
        output_type: table
        run_type: discrete
      multi_sql: |
        {% with input_material1 = this.DeRef("models/test_sql1") input_material2 = this.DeRef("inputs/tbl_a") input_material3 = this.DeRef("inputs/tbl_c") %}
          create {{this.GetMaterialization().OutputType.ToSql()}} {{this}} as (
            select b.id1, b.id2, b.id3, b.insert_ts, a.new_id1, a.num_a, c.num_b, c.num_c
            from {{ input_material1 }} a
            full outer join {{ input_material2 }} b
            on a.id2 = b.id2
            full outer join {{ input_material3 }} c
            on c.id2 = a.id2
          );
        {% endwith  %}

A SQL Template has following fields:

  1. name: Name with which the model is to be referred. You can later refer this table as “models/test_sql”

  2. model_type: sql_template

  3. model_spec: Contains the specifications of the model definition.

    1. validity_time

    2. materialization: It is explained in Materialisation Support section below.

    3. single_sql: This field will have the SQL template text added to serve the purpose for which this model was added. This SQL must evaluate to a single SELECT sql statement. After execution, it should produce a dataset which will materialize based on the provided materialization.

    4. multi_sql: This field will contain a sql template which can evaluate into multiple sql statements. The last statement must be a CREATE statement which is responsible for materializing the model into a table. Currently output_type of only table is supported.

    5. occurred_at_col: Adding this field means that this model is an event-stream model. The value should point to the column which contains timestamp value in the output of sql template.

    6. ids: Defining Ids is also supported like in input models. It is required in case we want to use sql model as an input to an input_var or entity_var.

Materialisation Support

materialization is the strategy of creating and persisting models into warehouse. It is composed of two things:

  1. output_type: Possible values are table, view, ephemeral.

  2. run_type: Possible values are discrete, incremental.

For SQL Template Models, we support all output types. Only discrete RunType is supported as of now.

  • A multi_sql Sql model can only be a table.

  • A single_sql Sql model supports all output_type.

For Sql template models, ephemeral is the default output type. Also, discrete is the default RunType.

Writing Sql Templates

  • Sql Templates can be written following the syntax of pongo2 template engine.

  • Sql Template model provides the flexibility to write custom sql and create outputs using the listed inputs. We can refer to any of the input models listed in the inputs.yaml or any of the models created in profiles.yaml.

  • While referencing models, circular referencing should be avoided. For example sql_model_a references sql_model_b and sql_model_b references sql_model_a.

  • The syntax {{ this.DeRef("path/to/model") }} can be used to reference any model.

Variables available in the sql_template:

  • this is the current model’s material which exposes a set of functionality to the template writer. this.DeRef("reference/to/model") lets us refer a particular input model inside the template. A complete set of available methods available on material (this) is listed in section below.

  • timestamp: timestamp at which the current run was started. My be used to filter new events.

Material Properties Accessible in SQL Template Models

  1. DeRef("path/to/model"):

Here, we refer to an input model through this.DeRef. It returns the DB Object corresponding to tbl_a. The DB object for any model is gives the actual name of the table/view in the warehouse. On the output, any of methods listed here is applicable. For example:

{% with input_table = this.DeRef("inputs/tbl_a") %}
    select a as new_a, b as new_b, {{input_table}}.*
      from {{input_table}}
{% endwith %}
  1. GetMaterialization():

GetMaterialization returns a struct with two fields: MaterializationSpec{OutputType, RunType}. It can be used as per the two examples given. OutputType should always be used with ToSql() method.

Examples:
  1. CREATE OR REPLACE {{this.GetMaterialization().OutputType.ToSql()}} {{this.GetSelectTargetSql()}} AS ...

  2. this.GetMaterialization().RunType

Other Helper Methods:

GetTimeFilteringColSql():

It gets the timestamp column of any material. Helps us in filtering out rows based on timestamp. It may return the occurred_at_col in case of an event_stream table or valid_at in case the material has that column. In absense of both of these, it will return an empty string. occurred_at column exists only when the model is an event stream. Example:

SELECT * FROM {{<from_material>}}
  WHERE
    <from_material>.GetTimeFilteringColSql() > <some_timestamp>;

Models enabling themselves

In the context of a dependency graph of models, an option has been introduced to set the required enable_status for each model using YAML. This allows us to specify the necessary materialization for the models, ensuring that only the required models are executed. Models that are deemed unnecessary can be excluded from the execution process.

Use Case:

Consider the scenario where an idStitcher model depends on tbl_a and tbl_b, and a Feature Table (ft1) depends on IdStitcher (ids) and an input model tbl_a.

If the ids model is marked as NotNeeded, and ft1 is also not required, there is no need to execute either of them. However, if ft1 is marked as GoodToHave or MustHave, we must run ids as well, resulting in the execution of all the models.

If tbl_a is disabled, the creation of ids becomes impossible, and consequently, ft1 will not run either. If ft1 or ids is marked as MustHave, the project will error out saying configuration is not possible at the time of load only. So the run wont start in such a case.

How it Works:

To implement the required materialization status for each model, in YAML writer you can include the option to specify the desired status. The materialization field in YAML can the following options:

  • MustHave: This indicates that the run should fail if it is not possible to execute the model.

  • GoodToHave: If execution is not possible for this model, it will not cause a failure.

  • NotNeeded: This status marks the model as not enabled if it does not fall within the dependency graph.

  • Disabled: The model is explicitly disabled and will not be executed.

The default value for the materialization field is GoodToHave. However, specific default values are set for certain cases:

The default_id_stitcher model is set to NotNeeded by default. This ensures that if the model is not required, it will not be executed.

Example project file

name: app_project
schema_version: 49
connection: test
include_untimed: true
model_folders:
  - models
entities:
  - name: user
    id_types:
      - user_id
      - anonymous_id
    default_id_stitcher:
      validity_time: 24h # 1 day
      materialization:
        run_type: incremental
        enable_status: goodToHave
      incremental_timedelta: 12h # half a day

id_types:
  - name: user_id
    filters:
      - type: include
        regex: "([0-9a-z])*"
      - type: exclude
        value: ""
  - name: anonymous_id

Model Contracts

Model contract allow us to declare constraints that the model should adheres. A model having dependency on other model would need to declare a contract specifying what all columns and entities input model must have, for contract validation those columns should be present in the referenced model.

For an input of a project, e.g. a library project, the model contract is used to enforce constraints on tables/views that get wired to it downstream.

# inputs.yaml
  - name: rsIdentifies
    contract:
      is_optional: false
      is_event_stream: true
      with_entity_ids:
        - user
      with_columns:
        - name: timestamp
        - name: user_id
        - name: anonymous_id
        - name: email

In SQL model, the contract contains all the columns from ID’s and features. Each internal model also publishes the contract it promises to adhere to. Suppose rsSessionTable has an input shopify_session_features. Model contracts enable rsSessionTable to specify the constraints that shopify_session_features must adhere to.

models:
- name: rsSessionTable
  model_type: sql_template
  model_spec:
    ... # model specifications
    single_sql: |
            {% set contract = BuildContract('{"with_columns":[{"name":"user_id"}, {"name":"anonymous_id"}]}') %}
      {% with SessionFeature = this.DeRef("models/shopify_session_features",contract)%}
          select user_id as id1, anonymous_id as id2 from {{SessionFeature}}
      {% endwith %}
    contract:
      with_entity_ids:
        - user
      with_columns:
        - name: id1
          type: string
        - name: id2
          type: string

Here, rsSessionTable is declaring that its input shopify_session_features must have columns user_id and anonymous_id. This will help in improving data quality and error handling. Internally, this requested contract is validated against shopify_session_features’s actual contract. For validation to pass, shopify_session_features_contract must be a subset of shopify_session_features’s published contract.

This enables more comprehensive static and dynamic validations of our projects.

Best Schema Version (tags)

We recommend using git-tags instead of the latest commit on main branch of library projects. That is, https://github.com/org-name/lib-name/tag/schema_{{best_schema_version}}. The selection of compatible git tags is done by PB, that is it will figure out the best compatible version for the lib package.

A sample project file:

packages:
  - name: shopify_features
    url: https://github.com/org-name/lib-names/tag/schema_{{best_schema_version}}
    inputsMap:
      rsCartUpdate: inputs/rsCartUpdate
      rsIdentifies: inputs/rsIdentifies

Using this will make our tool use the best compatible version of the library project, in case of any schema updates.

Please note: You don’t have to replace the placeholder {{best_schema_version}}. For instance, if https://github.com/org-name/lib-names/tags/ has a tag for schema_44, then https://github.com/org-name/lib-names/tag/schema_44 will be automatically used. Still, if you do replace the placeholder with actual tag name, then also the project will work.

Id Collator Model

Id Stitching is one of the most important features of profiles. Being able to do id stitching to determine which all accounts belong to the same customer is very important to get a 360 degree view of customers.

However many a times, we may not require id stitching for a particular entity. The reason may be that there are no edges in the id graph of an entity. If we want to build a feature table on such an entity, we still have to run id stitching. This although not wrong, is computationally redundant.

The idea of idCollator is to get all ids of that particular entity from various input tables and create one collated list of ids.

Example

Let’s take a case where we have defined two entities in our project. One is user and the other is session.

If user entity has multiple ids defined, there are basically edges which make the use of an id-stitcher logical. On the other hand, session may have only one id, ssn_id, there won’t be any possibility of edges. In such a case, all we need is a complete list of ssn_id’s.

Here is corresponding inputs and entities definition.

entities:
  - name: user
    id_column_name: user_rud_id
    id_types:
      - user_id
      - anonymous_id
  - name: session
    id_column_name: session_id
    id_types:
      - ssn_id

Project file:

inputs:
  - name: user_accounts
    table: tbl_user_accounts
    occurred_at_col: insert_ts
    ids:
      - select: "user_id"
        type: user_id
        entity: user
  - name: sign_in
    table: tbl_sign_in
    occurred_at_col: insert_ts
    ids:
      - select: "user_id"
        type: user_id
        entity: user
      - select: "ssn_id"
        type: ssn_id
        entity: session
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
  - name: sign_up
    table: tbl_sign_up
    occurred_at_col: insert_ts
    ids:
      - select: "user_id"
        type: user_id
        entity: user
      - select: "ssn_id"
        type: ssn_id
        entity: session
      - select: "anonymous_id"
        type: anonymous_id
        entity: user

Here, the entity: session has only one Id type. Creating an idStitcher for such an entity is possible but unnecessary and an overkill.

Using all the models having ssn_id, we can just make a union of all ssn_id and get all distinct values of it and be clear that this is the final list of sessions without having to stitch anything.

The underlying sql may look like:

SELECT ssn_id as session_id from sign_in
        UNION DISTINCT
    SELECT ssn_id as session_id from sign_up
;

YAML Changes

Yaml writer cannot define a custom IdCollator the way they define a custom IdStitcher. If an entity has no edges, the pb project will automatically figure out if an IdCollator is needed. To exclude certain inputs (having the required id) from being used in the collation, we can just set to_id_stitcher: false in the input.

entities:
  - name: session
    id_column_name: session_id
    id_types:
      - ssn_id

The id_column_name is a new field added in the entity definition which will be the name of the id column and it applies to both id-stitcher and id-collator.

One thing to note here is that in idCollator we won’t generate a uuid like we do in idStitcher.

Comparing IdCollator & IdStitcher

ID Stitcher

ID Collator

Uses edges to converge the id graph.

Collates all distinct ids as there is only one Id Type and no edges are present.

Higher cost of computation.

Lower cost of computation.

A uuid is generated and used as the unique identifier for the entity.

Collates the existing ids only.

The generated id is always of the type: rudder_id

The id column of the generated idCollated table/view will be of the idType of the corresponding id.

User may override the default IdStitcher with custom one.

Can’t override the default idCollator, though user may define a custom idStitcher to override default IdCollator.

Associating SSH Key to Git Project

Follow the below steps to add public SSH key to your git project.

  1. Open your profile’s Git Repository in web browser and click on “Settings” tab.

  2. Then click “SSH Keys”.

  3. Give any name, say “Sample Profiles Key”, and paste the key generated by RudderStack web app OR public-key generated using CLI.

  4. You need not check “Allow write access”. Click on “Add Key”.

For additional reference, you may check out these links:

Note

RudderStack is only getting access to read the Git project to which the generated key was added. RudderStack would not have access to any other project. Moreover, we store all secrets securely.

Warning

On the web app, some legacy Profiles have issues, in case SSH key wasn’t generated for them. We recommend creating a new Profile in such a case.

Using private Git repos via CLI

It can be done using the following steps:

Step 1: Generating SSH key:

Refer to this.

Step 2: Associating SSH Key to Git Project:

Step 3: Adding private keys as credentials:

Given below needs to be added in siteconfig.yaml:

gitcreds:
  - reporegex: git@<provider-host>:<org-name>/*
    key: |
    -----BEGIN OPENSSH PRIVATE KEY-----
    b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEb..........
    -----END OPENSSH PRIVATE KEY-----

Custom ID Stitcher

You can define a custom ID stitcher in certain cases. To achieve this, you define the id stitching model in models/profiles.yaml.

models:
- name: sample_id_stitcher # This would be name of view in WH generated by id stitching run.
  model_type: id_stitcher # This must be as in this example. Two types of models are supported as of now. One for id stitching and another for feature table.
  model_spec:
    validity_time: 24h # 1 day.  Specifies how long a model is considered valid for, with respect to its timestamp.
    entity_key: user
    materialization:
      run_type: incremental # This allows you to specify the materialization run type (discrete or incremental).  If not specified, defaults to discrete
    incremental_timedelta: 12h # buffer time window to fetch new data from checkpoint time of previous model run. This is to ensure no data loss. For example, if first run read data till timestamp T, next run will fetch new data from timestamp `T-incremental_timedelta`, which is 12 hours in this case. Note: applicable for materialization run_type: incremental
    main_id_type: main_id # This is how you want to name the stitched canonical id. Do not add this key unless it's explicitly required, like if you want your ID stitcher table's main_id column to be called ``main_id``.
    edge_sources: # data sources which has various identifiers so that mapping can be used.
      - inputs/rsIdentifies
      - inputs/rsTracks

Now that above sample yaml files have briefly explained how an ID stitching model can be configured, lets define each field in detail. This should help you write your own model considering multiple data sources for identifiers.

Entities

Here we specify entities which models can be defined for.

type

object

properties

  • entity_key

A friendly name to be given to the entity. e.g.: user, company, prospect, etc.

type

string

  • id_types

This defines list of identifiers for the entity.

These may have a many-to-1 relationship to an entity (a user may have many session_ids over time).

But each id must belong to a single entity.

For example, a user entity might have id types for Salesforce ID, RudderStack anonymous ID, Email and Session Id.

It should not include something like IP Address though,

as a single IP could have been used by different users at different times,

and this is not an identifier for a user.

type

list

  • filters

Optional, in case you want to apply any include or exclude conditions for fetching data.

(same as value as in where clause in SQL or a regex).

type

string

Note

id_types is not the same as database column type, such as char or int.

Model Input

In each object under the inputs key, we specify the input tables on which the project depends.

type

object

properties

  • name

A friendly name to be given to the input model.

type

string

  • contract

A contract has information about that model such as the columns it should have and the entity ID’s in it. This is crucial for other models that depend on it, as it helps find errors early and closer to the point of their origin.

type: list

properties

  • is_optional

Whether the model’s existence in the warehouse is mandatory..

type

boolean

  • is_event_stream

Whether the table/view is a series/stream of events or not. or not. A model is an event stream model if and only if it has a timestamp column.

type

boolean

  • with_entity_ids

List of all entities with which this model is related. A model M1 is considered related to model M2 if there is an ID of model M2 among M1’s output columns.

type

list

  • with_columns

List of all columns that the model should have.

type

list

  • app_defaults

Values the input defaults to, when the project is run directly. For library projects, inputs can be remapped and appdefaults overridden, when library projects are imported.

type: list

properties

  • table

The table in warehouse that holds the data.

The table may be prefixed with external schema or database in the same warehouse, if applicable.

Either table/view/csv must be set, for each input source.

type

string

  • view

The view in warehouse that holds the data.

The view may be prefixed with external schema or database in the same warehouse, if applicable.

Either table/view/csv must be set, for each input source.

type

string

  • csv

The csv file on your local storage that holds the data.

The file path should be relative to the project folder.

Either table/view/csv must be set, for each input source.

type

string

  • occurred_at_col

The timestamp col.

If a model is run for a particular timestamp t1, it will use this column to filter the data to exclude data after t1.

I.e., ... WHERE <timestamp_col> < t1.

Optional, but strongly recommended.

type

string

  • ids

Some of the input columns may contain IDs of associated entities.

By their presence, such ID columns associate the row with the entity of the ID.

The Id Stitcher may use these ID declarations to automatically discover ID to ID Edges.

Here, we specify list of all IDs from the source table that are present,

along with their column names(or column SQL expressions).

type: list

properties

  • select

Use column name to use the value in the column as ID.

An SQL expression may also be specified, if some transformation is required.

type

string

  • type

The ID type to use.

All the ID types of a project are declared in pb_project.yaml

and can specify additional filters on the column expression.

Each ID type is linked only with a single entity and determines the entity associated.

type

string

  • entity

The entity, as defined in project file, to which the ID belongs to.

type

string

  • to_default_stitcher

Optional, it needs to be set to true for an ID to be picked in the default ID stitcher.

By default it is set to false, without impacting if the project is using custom ID stitcher.

In your project file, if you remove the key id_stitcher: models/<name of ID stitcher model>, then it’ll use the default ID stitcher

and create a material view of the name <entity_name>_default_id_stitcher.

type

boolean

Warning

In a model, an input can’t use columns named “MAIN_ID”, “OTHER_ID”, “OTHER_ID_TYPE”, or “VALID_AT” in its ID SQL.

Model Definition

Give specifications of ID stitching.

type: object

properties

  • name

Name of the ID stitched table created on the DW.

Say if you define this as final_id_stitcher then the output table will be named something like

Material_final_id_stitcher_<rest of generated hash>_<timestamp number>.

type

string

  • model_type

Define the types of models.

For ID stitching, set it to id_stitcher.

type

string

  • model_spec

Specifications for creating the model

type

object

properties

  • validity_time

Specifies how long a model is considered valid for, with respect to its timestamp.

For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with

validity_time: 24h would still be considered potentially valid and usable for any run requests,

which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC.

type

time

  • entity_key

Specify relevant entity from your input.yaml.

For example, here it should be set to user.

type

string

  • main_id_type

Optional, an id type reserved for the output of the id stitcher.

It must not be used in any of the inputs and must be listed as an id type for the entity being stitched.

If you do not set then it’s the default rudder_id.

Often set it to main_id.

Do not add this key unless it’s explicitly required,

like if you want your ID stitcher table’s main_id column to be called main_id.

type

ProjectRef

  • materialization

Add the key run_type: incremental to run in incremental mode.

This mode will consider inserts and any updated rows from edge sources input.

These are inferred by checking timestamp column for next run.

One can provide buffer time to consider any lag in data in WH for next incremental run.

Like if new rows get added in the time it’s being run.

If you do not specify this key then it’ll default to run_type: discrete.

type

list

  • incremental_timedelta

If you set the materialization key to run_type: incremental then it sets

how far back prior to the previous material for a model data should be fetched (to handle data lag, for example).

It is optional, if not set then it defaults to 4 days.

type

list

  • edge_sources

In input yaml file, we specify the different sources of input tables/views, that are used to create the ID Stitcher view.

Here we specify a list having names of all such inputs.

type

list

Warning

Incremental run ignores deleted rows from edge sources. To consider those rows, you’ll have to run in full refresh mode again i.e. run_type: discrete.

Partial feature tables

If a library package is imported and some of the input models assumed in the package are not present in the warehouse, still you can run the project. What happens is when you remap some of the input models to nil, those inputs get disabled and correspondingly the features directly or indirectly dependent upon them get disabled. So a partial feature table will still be created from whatever inputs are available. Similarly, IdStitcher also runs if few of the edge sources are not present in the warehouse and remapped to nil.