Tutorials
Migrate your existing project
You can use the migrate
command to migrate your project to the latest schema.
Auto migrate
Here, pb will automatically migrate your project to the newest schema version, without requiring any manual efforts by the user.
To do so: Navigate to the folder where your project files are stored. Then execute one of the following:
pb migrate auto --inplace
(to replace contents of existing folder with the migrated one)
OR
pb migrate auto -d NewFolder
(store migrated project files in a folder named NewFolder)
You will see a message on screen such as “Project migrated from version 30 to version 44”, confirming that migration is now complete.
Note
migrate auto
works on schema 18 or above.
Manual migrate
Here, pb will list all the steps you can follow to migrate your project.
To do so: Execute command pb migrate manual
and follow the suggested steps.
To manually migrate your project from Schema 42 to 43:
Kindly use {{entity-name.Var(var-name)}}
to refer to an entity-var
or an input-var
.
For example, for entity_var user_lifespan
in HelloPbProject, change select: last_seen - first_seen
to select: '{{user.Var("last_seen")}} - {{user.Var("first_seen")}}'
.
Note: 1. Two curly brackets are a must. 2. If anything starts with double curly brackets then has to be inside double quotes “”. If we use single quotes inside double quotes then we have to use escape character () which comes when using macros.
Linear dependency - This parameter is to be specified when entity as vars migration is not done (till version 43). After that is done, it’s not necessary to mention these, and can be removed.
compatibility_mode:
linear_dependency_of_vars: true
SQL Template Models
SQL Template Models allow users to create models with custom sql. It aims at increasing the capability of end user to build complex models with custom SQL syntax. SQL Template models can be used as an input to an entity-var/ input-var in feature table or as an edge-source in id-stitcher.
Example UseCases:
Provide capability to construct more complex models which is difficult otherwise. For example, we can create a model using FLATTEN json data, JOIN tables etc. on any existing table or set of tables in the warehouse. Or we can perform joins over multiple tables to get some output which would not be possible otherwise.
In a PB project, customers can use models from imported library packages. In that case, there might be deviation in table names and column names present in warehouse vs the table name and column name assumed in package models. SQL Models can be used to remap fields to handle deviations.
A SQL Template Model can be of 2 types:
single_sql: It should be a single select statement.
multi_sql: It can have multiple sql statements. But eventually, it should have a create statement which creates the required table.
Below, there are examples of both of them.
single_sql:
models:
- name: test_sql
model_type: sql_template
model_spec:
validity_time: 24h# 1 day
materialization: // optional
output_type: ephemeral // optional [table, view, ephemeral]
run_type: discrete // optional [discrete, incremental]
single_sql: |
{%- with input1 = this.DeRef("inputs/tbl_a") -%}
select id1 as new_id1, id2 as new_id2, {{input1}}.*
from {{input1}}
{%- endwith -%}
occurred_at_col: insert_ts // optional
ids:
- select: "new_id1"
type: test_id
entity: user
- select: "new_id2"
type: test_id
entity: user
- select: "id3"
type: test_id
entity: user
multi_sql:
models:
- name: test_sql2
model_type: sql_template
model_spec:
validity_time: 24h # 1 day
materialization:
output_type: table
run_type: discrete
multi_sql: |
{% with input_material1 = this.DeRef("models/test_sql1") input_material2 = this.DeRef("inputs/tbl_a") input_material3 = this.DeRef("inputs/tbl_c") %}
create {{this.GetMaterialization().OutputType.ToSql()}} {{this}} as (
select b.id1, b.id2, b.id3, b.insert_ts, a.new_id1, a.num_a, c.num_b, c.num_c
from {{ input_material1 }} a
full outer join {{ input_material2 }} b
on a.id2 = b.id2
full outer join {{ input_material3 }} c
on c.id2 = a.id2
);
{% endwith %}
A SQL Template has following fields:
name
: Name with which the model is to be referred. You can later refer this table as “models/test_sql”model_type
: sql_templatemodel_spec
: Contains the specifications of the model definition.validity_time
materialization
: It is explained in Materialisation Support section below.single_sql
: This field will have the SQL template text added to serve the purpose for which this model was added. This SQL must evaluate to a singleSELECT
sql statement. After execution, it should produce a dataset which will materialize based on the provided materialization.multi_sql
: This field will contain a sql template which can evaluate into multiple sql statements. The last statement must be aCREATE
statement which is responsible for materializing the model into a table. Currently output_type of only table is supported.occurred_at_col
: Adding this field means that this model is an event-stream model. The value should point to the column which contains timestamp value in the output of sql template.ids
: Defining Ids is also supported like in input models. It is required in case we want to use sql model as an input to aninput_var
orentity_var
.
Materialisation Support
materialization
is the strategy of creating and persisting models into warehouse. It is composed of two things:
output_type
: Possible values aretable
,view
,ephemeral
.run_type
: Possible values arediscrete
,incremental
.
For SQL Template Models, we support all output types. Only discrete RunType is supported as of now.
A
multi_sql
Sql model can only be atable
.A
single_sql
Sql model supports alloutput_type
.
For Sql template models, ephemeral
is the default output type. Also, discrete
is the default RunType.
Writing Sql Templates
Sql Templates can be written following the syntax of pongo2 template engine.
Sql Template model provides the flexibility to write custom sql and create outputs using the listed inputs. We can refer to any of the input models listed in the
inputs.yaml
or any of the models created inprofiles.yaml
.While referencing models, circular referencing should be avoided. For example
sql_model_a
referencessql_model_b
andsql_model_b
referencessql_model_a
.The syntax
{{ this.DeRef("path/to/model") }}
can be used to reference any model.
Variables available in the sql_template:
this
is the current model’s material which exposes a set of functionality to the template writer.this.DeRef("reference/to/model")
lets us refer a particular input model inside the template. A complete set of available methods available on material (this
) is listed in section below.timestamp
: timestamp at which the current run was started. My be used to filter new events.
Material Properties Accessible in SQL Template Models
DeRef("path/to/model")
:
Here, we refer to an input model through this.DeRef. It returns the DB Object corresponding to tbl_a. The DB object for any model is gives the actual name of the table/view in the warehouse. On the output, any of methods listed here is applicable. For example:
{% with input_table = this.DeRef("inputs/tbl_a") %}
select a as new_a, b as new_b, {{input_table}}.*
from {{input_table}}
{% endwith %}
GetMaterialization()
:
GetMaterialization
returns a struct with two fields: MaterializationSpec{OutputType, RunType}
. It can be used as per the two examples given. OutputType
should always be used with ToSql()
method.
- Examples:
CREATE OR REPLACE {{this.GetMaterialization().OutputType.ToSql()}} {{this.GetSelectTargetSql()}} AS ...
this.GetMaterialization().RunType
Other Helper Methods:
GetTimeFilteringColSql()
:
It gets the timestamp column of any material. Helps us in filtering out rows based on timestamp. It may return the occurred_at_col
in case of an event_stream table or valid_at
in case the material has that column. In absense of both of these, it will return an empty string.
occurred_at
column exists only when the model is an event stream.
Example:
SELECT * FROM {{<from_material>}}
WHERE
<from_material>.GetTimeFilteringColSql() > <some_timestamp>;
Models enabling themselves
In the context of a dependency graph of models, an option has been introduced to set the required enable_status
for each model using YAML.
This allows us to specify the necessary materialization for the models, ensuring that only the required models are executed.
Models that are deemed unnecessary can be excluded from the execution process.
Use Case:
Consider the scenario where an idStitcher model depends on tbl_a
and tbl_b
, and a Feature Table (ft1
) depends on IdStitcher (ids
) and an input model tbl_a
.
If the ids
model is marked as NotNeeded
, and ft1
is also not required, there is no need to execute either of them. However, if ft1
is marked as GoodToHave
or MustHave
, we must run ids
as well, resulting in the execution of all the models.
If tbl_a
is disabled, the creation of ids
becomes impossible, and consequently, ft1
will not run either. If ft1
or ids
is marked as MustHave
, the project will error out saying configuration is not possible at the time of load only. So the run wont start in such a case.
How it Works:
To implement the required materialization status for each model, in YAML writer you can include the option to specify the desired status. The materialization
field in YAML can the following options:
MustHave
: This indicates that the run should fail if it is not possible to execute the model.GoodToHave
: If execution is not possible for this model, it will not cause a failure.NotNeeded
: This status marks the model as not enabled if it does not fall within the dependency graph.Disabled
: The model is explicitly disabled and will not be executed.
The default value for the materialization
field is GoodToHave
. However, specific default values are set for certain cases:
The default_id_stitcher
model is set to NotNeeded
by default. This ensures that if the model is not required, it will not be executed.
Example project file
name: app_project
schema_version: 49
connection: test
include_untimed: true
model_folders:
- models
entities:
- name: user
id_types:
- user_id
- anonymous_id
default_id_stitcher:
validity_time: 24h # 1 day
materialization:
run_type: incremental
enable_status: goodToHave
incremental_timedelta: 12h # half a day
id_types:
- name: user_id
filters:
- type: include
regex: "([0-9a-z])*"
- type: exclude
value: ""
- name: anonymous_id
Model Contracts
Model contract allow us to declare constraints that the model should adheres. A model having dependency on other model would need to declare a contract specifying what all columns and entities input model must have, for contract validation those columns should be present in the referenced model.
For an input of a project, e.g. a library project, the model contract is used to enforce constraints on tables/views that get wired to it downstream.
# inputs.yaml
- name: rsIdentifies
contract:
is_optional: false
is_event_stream: true
with_entity_ids:
- user
with_columns:
- name: timestamp
- name: user_id
- name: anonymous_id
- name: email
In SQL model, the contract contains all the columns from ID’s and features.
Each internal model also publishes the contract it promises to adhere to. Suppose rsSessionTable
has an input shopify_session_features
.
Model contracts enable rsSessionTable
to specify the constraints that shopify_session_features
must adhere to.
models:
- name: rsSessionTable
model_type: sql_template
model_spec:
... # model specifications
single_sql: |
{% set contract = BuildContract('{"with_columns":[{"name":"user_id"}, {"name":"anonymous_id"}]}') %}
{% with SessionFeature = this.DeRef("models/shopify_session_features",contract)%}
select user_id as id1, anonymous_id as id2 from {{SessionFeature}}
{% endwith %}
contract:
with_entity_ids:
- user
with_columns:
- name: id1
type: string
- name: id2
type: string
Here, rsSessionTable
is declaring that its input shopify_session_features
must have columns user_id
and anonymous_id
. This will help in improving data quality and error handling.
Internally, this requested contract is validated against shopify_session_features
’s actual contract. For validation to pass, shopify_session_features_contract
must be a subset of shopify_session_features
’s published contract.
This enables more comprehensive static and dynamic validations of our projects.
Best Schema Version (tags)
We recommend using git-tags instead of the latest commit on main branch of library projects.
That is, https://github.com/org-name/lib-name/tag/schema_{{best_schema_version}}
.
The selection of compatible git tags is done by PB, that is it will figure out the best compatible version for the lib package.
A sample project file:
packages:
- name: shopify_features
url: https://github.com/org-name/lib-names/tag/schema_{{best_schema_version}}
inputsMap:
rsCartUpdate: inputs/rsCartUpdate
rsIdentifies: inputs/rsIdentifies
Using this will make our tool use the best compatible version of the library project, in case of any schema updates.
Please note: You don’t have to replace the placeholder {{best_schema_version}}
.
For instance, if https://github.com/org-name/lib-names/tags/
has a tag for schema_44, then https://github.com/org-name/lib-names/tag/schema_44
will be automatically used.
Still, if you do replace the placeholder with actual tag name, then also the project will work.
Id Collator Model
Id Stitching is one of the most important features of profiles. Being able to do id stitching to determine which all accounts belong to the same customer is very important to get a 360 degree view of customers.
However many a times, we may not require id stitching for a particular entity. The reason may be that there are no edges in the id graph of an entity. If we want to build a feature table on such an entity, we still have to run id stitching. This although not wrong, is computationally redundant.
The idea of idCollator is to get all ids of that particular entity from various input tables and create one collated list of ids.
Example
Let’s take a case where we have defined two entities in our project. One is user
and the other is session
.
If user entity has multiple ids defined, there are basically edges which make the use of an id-stitcher logical.
On the other hand, session may have only one id, ssn_id
, there won’t be any possibility of edges.
In such a case, all we need is a complete list of ssn_id
’s.
Here is corresponding inputs and entities definition.
entities:
- name: user
id_column_name: user_rud_id
id_types:
- user_id
- anonymous_id
- name: session
id_column_name: session_id
id_types:
- ssn_id
Project file:
inputs:
- name: user_accounts
table: tbl_user_accounts
occurred_at_col: insert_ts
ids:
- select: "user_id"
type: user_id
entity: user
- name: sign_in
table: tbl_sign_in
occurred_at_col: insert_ts
ids:
- select: "user_id"
type: user_id
entity: user
- select: "ssn_id"
type: ssn_id
entity: session
- select: "anonymous_id"
type: anonymous_id
entity: user
- name: sign_up
table: tbl_sign_up
occurred_at_col: insert_ts
ids:
- select: "user_id"
type: user_id
entity: user
- select: "ssn_id"
type: ssn_id
entity: session
- select: "anonymous_id"
type: anonymous_id
entity: user
Here, the entity: session
has only one Id type.
Creating an idStitcher for such an entity is possible but unnecessary and an overkill.
Using all the models having ssn_id
, we can just make a union of all ssn_id
and get all distinct values of it and be clear that this is the final list of sessions without having to stitch anything.
The underlying sql may look like:
SELECT ssn_id as session_id from sign_in
UNION DISTINCT
SELECT ssn_id as session_id from sign_up
;
YAML Changes
Yaml writer cannot define a custom IdCollator the way they define a custom IdStitcher.
If an entity has no edges, the pb project will automatically figure out if an IdCollator is needed.
To exclude certain inputs (having the required id) from being used in the collation, we can just set to_id_stitcher: false
in the input.
entities:
- name: session
id_column_name: session_id
id_types:
- ssn_id
The id_column_name is a new field added in the entity definition which will be the name of the id column and it applies to both id-stitcher and id-collator.
One thing to note here is that in idCollator we won’t generate a uuid like we do in idStitcher.
ID Stitcher |
ID Collator |
---|---|
Uses edges to converge the id graph. |
Collates all distinct ids as there is only one Id Type and no edges are present. |
Higher cost of computation. |
Lower cost of computation. |
A uuid is generated and used as the unique identifier for the entity. |
Collates the existing ids only. |
The generated id is always of the type: |
The id column of the generated idCollated table/view will be of the idType of the corresponding id. |
User may override the default IdStitcher with custom one. |
Can’t override the default idCollator, though user may define a custom idStitcher to override default IdCollator. |
Associating SSH Key to Git Project
Follow the below steps to add public SSH key to your git project.
Open your profile’s Git Repository in web browser and click on “Settings” tab.
Then click “SSH Keys”.
Give any name, say “Sample Profiles Key”, and paste the key generated by RudderStack web app OR public-key generated using CLI.
You need not check “Allow write access”. Click on “Add Key”.
For additional reference, you may check out these links:
Note
RudderStack is only getting access to read the Git project to which the generated key was added. RudderStack would not have access to any other project. Moreover, we store all secrets securely.
Warning
On the web app, some legacy Profiles have issues, in case SSH key wasn’t generated for them. We recommend creating a new Profile in such a case.
Using private Git repos via CLI
It can be done using the following steps:
Step 1: Generating SSH key:
Refer to this.
Step 2: Associating SSH Key to Git Project:
For this refer to Associating SSH Key to Git Project
Step 3: Adding private keys as credentials:
Given below needs to be added in
siteconfig.yaml
:gitcreds: - reporegex: git@<provider-host>:<org-name>/* key: | -----BEGIN OPENSSH PRIVATE KEY----- b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEb.......... -----END OPENSSH PRIVATE KEY-----
Custom ID Stitcher
You can define a custom ID stitcher in certain cases.
To achieve this, you define the id stitching model in models/profiles.yaml
.
models:
- name: sample_id_stitcher # This would be name of view in WH generated by id stitching run.
model_type: id_stitcher # This must be as in this example. Two types of models are supported as of now. One for id stitching and another for feature table.
model_spec:
validity_time: 24h # 1 day. Specifies how long a model is considered valid for, with respect to its timestamp.
entity_key: user
materialization:
run_type: incremental # This allows you to specify the materialization run type (discrete or incremental). If not specified, defaults to discrete
incremental_timedelta: 12h # buffer time window to fetch new data from checkpoint time of previous model run. This is to ensure no data loss. For example, if first run read data till timestamp T, next run will fetch new data from timestamp `T-incremental_timedelta`, which is 12 hours in this case. Note: applicable for materialization run_type: incremental
main_id_type: main_id # This is how you want to name the stitched canonical id. Do not add this key unless it's explicitly required, like if you want your ID stitcher table's main_id column to be called ``main_id``.
edge_sources: # data sources which has various identifiers so that mapping can be used.
- inputs/rsIdentifies
- inputs/rsTracks
Now that above sample yaml files have briefly explained how an ID stitching model can be configured, lets define each field in detail. This should help you write your own model considering multiple data sources for identifiers.
Entities
Here we specify entities which models can be defined for. |
||
type |
object |
|
properties |
||
|
A friendly name to be given to the entity. e.g.: user, company, prospect, etc. |
|
type |
string |
|
|
This defines list of identifiers for the entity. These may have a many-to-1 relationship to an entity (a user may have many session_ids over time). But each id must belong to a single entity. For example, a user entity might have id types for Salesforce ID, RudderStack anonymous ID, Email and Session Id. It should not include something like IP Address though, as a single IP could have been used by different users at different times, and this is not an identifier for a user. |
|
type |
list |
|
|
Optional, in case you want to apply any (same as |
|
type |
string |
Note
id_types
is not the same as database column type, such as char or int.
Model Input
In each object under the inputs key, we specify the input tables on which the project depends. |
||||
type |
object |
|||
properties |
||||
|
A friendly name to be given to the input model. |
|||
type |
string |
|||
|
A contract has information about that model such as the columns it should have and the entity ID’s in it. This is crucial for other models that depend on it, as it helps find errors early and closer to the point of their origin. |
|||
type: list |
||||
properties |
||||
|
Whether the model’s existence in the warehouse is mandatory.. |
|||
type |
boolean |
|||
|
Whether the table/view is a series/stream of events or not. or not. A model is an event stream model if and only if it has a timestamp column. |
|||
type |
boolean |
|||
|
List of all entities with which this model is related. A model M1 is considered related to model M2 if there is an ID of model M2 among M1’s output columns. |
|||
type |
list |
|||
|
List of all columns that the model should have. |
|||
type |
list |
|||
|
Values the input defaults to, when the project is run directly. For library projects, inputs can be remapped and appdefaults overridden, when library projects are imported. |
|||
type: list |
||||
properties |
||||
|
The table in warehouse that holds the data. The table may be prefixed with external schema or database in the same warehouse, if applicable. Either table/view/csv must be set, for each input source. |
|||
type |
string |
|||
|
The view in warehouse that holds the data. The view may be prefixed with external schema or database in the same warehouse, if applicable. Either table/view/csv must be set, for each input source. |
|||
type |
string |
|||
|
The csv file on your local storage that holds the data. The file path should be relative to the project folder. Either table/view/csv must be set, for each input source. |
|||
type |
string |
|||
|
The timestamp col. If a model is run for a particular timestamp t1, it will use this column to filter the data to exclude data after t1. I.e., Optional, but strongly recommended. |
|||
type |
string |
|||
|
Some of the input columns may contain IDs of associated entities. By their presence, such ID columns associate the row with the entity of the ID. The Id Stitcher may use these ID declarations to automatically discover ID to ID Edges. Here, we specify list of all IDs from the source table that are present, along with their column names(or column SQL expressions). |
|||
type: list |
||||
properties |
||||
|
Use column name to use the value in the column as ID. An SQL expression may also be specified, if some transformation is required. |
|||
type |
string |
|||
|
The ID type to use. All the ID types of a project are declared in pb_project.yaml and can specify additional filters on the column expression. Each ID type is linked only with a single entity and determines the entity associated. |
|||
type |
string |
|||
|
The entity, as defined in project file, to which the ID belongs to. |
|||
type |
string |
|||
|
Optional, it needs to be set to By default it is set to In your project file, if you remove the key and create a material view of the name |
|||
type |
boolean |
Warning
In a model, an input can’t use columns named “MAIN_ID”, “OTHER_ID”, “OTHER_ID_TYPE”, or “VALID_AT” in its ID SQL.
Model Definition
Give specifications of ID stitching. |
|||
type: object |
|||
properties |
|||
|
Name of the ID stitched table created on the DW. Say if you define this as
|
||
type |
string |
||
|
Define the types of models. For ID stitching, set it to |
||
type |
string |
||
|
Specifications for creating the model |
||
type |
object |
||
properties |
|||
|
Specifies how long a model is considered valid for, with respect to its timestamp. For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with
which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC. |
||
type |
time |
||
|
Specify relevant entity from your input.yaml. For example, here it should be set to |
||
type |
string |
||
|
Optional, an id type reserved for the output of the id stitcher. It must not be used in any of the inputs and must be listed as an id type for the entity being stitched. If you do not set then it’s the default Often set it to
like if you want your ID stitcher table’s main_id column to be called |
||
type |
ProjectRef |
||
|
Add the key This mode will consider inserts and any updated rows from edge sources input. These are inferred by checking timestamp column for next run. One can provide buffer time to consider any lag in data in WH for next incremental run. Like if new rows get added in the time it’s being run. If you do not specify this key then it’ll default to |
||
type |
list |
||
|
If you set the how far back prior to the previous material for a model data should be fetched (to handle data lag, for example). It is optional, if not set then it defaults to 4 days. |
||
type |
list |
||
|
In input yaml file, we specify the different sources of input tables/views, that are used to create the ID Stitcher view. Here we specify a list having names of all such inputs. |
||
type |
list |
Warning
Incremental run ignores deleted rows from edge sources. To consider those rows, you’ll have to run in full refresh mode again i.e. run_type: discrete
.
Partial feature tables
If a library package is imported and some of the input models assumed in the package are not present in the warehouse, still you can run the project. What happens is when you remap some of the input models to nil, those inputs get disabled and correspondingly the features directly or indirectly dependent upon them get disabled. So a partial feature table will still be created from whatever inputs are available. Similarly, IdStitcher also runs if few of the edge sources are not present in the warehouse and remapped to nil.