Project Layout
Folder Structure
README.md
Contains quick overview of how to use along with SQL queries for data analysis.
pb_project.yaml
The project yaml file contains name, schema version, profile and entities name along with id types.
Here is an example of the yaml file:
# Project name
name: sample_attribution
# Project's yaml schema version
schema_version: 49
# WH Connection to use.
connection: test
# Whether to allow inputs having no timestamps,
# if true, data without timestamps are included when running models
include_untimed: true
# Model folders to use.
model_folders:
- models
# Entities in this project and their ids.
entities:
- name: user
# Change the following to set a custom ID stitcher(optional).
# id_stitcher: models/user_id_stitcher
id_types:
- main_id
- user_id
- anonymous_id
- email
# Entity traits 360, to get all features/traits of an entity into a single view (optional)
serve_traits:
- id_served: user_id
- id_served: anonymous_id
- id_served: email
- id_served: user_main_id
# lib packages can be imported in project signifying that this project inherits its properties from there
packages:
- name: corelib
url: "https://github.com/rudderlabs/profiles-corelib/tag/schema_{{best_schema_version}}"
# if required then you can extend the package definition such as for ID types.
# Profiles can also use certain model types defined in Python.
# Examples include ML models. Those dependencies are specified here.
python_requirements:
- profiles-pycorelib==0.1.0
Let’s look at each field to know more about it.
name
It is the name of the project that you want to give. |
|
type |
string |
schema_version
It is the project’s yaml version. Each new version of schema comes with improvements and added functionalities. |
|
type |
int |
connection
Name of the connection from siteconfig.yaml file to be used for connecting to warehouse. For more details, refer siteconfig.yaml. |
|
type |
string |
include_untimed
Whether to allow inputs having no timestamps. If true, data without timestamps are included when running models. |
|
type |
boolean |
model_folders
Enter name of folder(s) where model files are stored. |
|
type |
string |
entities
Lists all entities used in the project. An entity in DBMS is a real-world thing or a real-world object which is distinguishable from other such objects. Each entity has atleast one ID. Each entry for an entity here is a JSON object specifying entity’s name and attributes. |
||
type: list |
||
properties |
||
|
Name of the entity used in the project |
|
type |
string |
|
|
(Optional) modelRef of the ID stitcher, linked to the entity. Defined in case of custom ID stitcher. Prefix it with the relative path of models folder, e.g. To customize the ID stitcher, please refer Custom ID Stitcher. |
|
type |
string |
|
|
An entity may be associated with multiple ID types. Here we list all ID types associated with the current entity. |
|
type |
list |
|
|
(optional) List of all the ID’s being served. This is for Entity Traits 360 Models. Also, it’s required for Activation API to work. |
|
type |
list |
packages
Library packages can be imported in a project, signifying that this project inherits its properties from there. |
||
type: list |
||
properties |
||
|
Specify a friendly name. |
|
type |
string |
|
|
https url of the lib package, with tag for best schema version. |
|
type |
string |
serve_traits
Please refer Entity Traits 360 Models.
id_types
Each entry is an object with ID type’s name and attributes. Optionally, you can also define filters for including or excluding certain values. |
|||
type: list |
|||
properties |
|||
|
Name of the ID type. |
||
type |
string |
||
|
Optional, it is used to extend ID types specifications in a package. |
||
type |
string |
||
|
We use filters to exclude or include some values from id types. Whenever the current Id type appears, it may need to be filtered or transformed. filters section contains type, value, and regex expression of the filter. |
||
type: list constraint: At a time, only one of regex or value field should be set. |
|||
properties |
|||
|
Valid filter types are ‘include’ and ‘exclude’. Include filters accept the IDs on match. Exclude reject the IDs on match. |
||
type |
string |
||
|
A value to match. Usually used to reject certain invalid ID values like ‘NaN’, ‘unknown’, ‘test@domain.com’ etc. |
||
type |
string |
||
|
Regex to match the values with. May be used to both accept or reject matching IDs depending upon requirement. |
||
type |
regex string |
||
|
SQL statement with keys for
For example: inputs/my_csv_input, models/my_sql_filters_model, etc. |
||
type |
sql list |
Inputs
Inputs act as a mechanism for a project to access pre-existing or externally sourced data, potentially originated from outside the current project. The project has the capability to define specific constraints on inputs, outlining these criteria within the contract section. When a project is imported, its inputs can be mapped to any model already existing within the importer project, as long as it matches the input contract.
The source of the input data can be specified by using the exact name in the table/view key. The exact source of data can also be an S3 bucket (specified using s3 key) or a local CSV file (specified using csv key).
In addition to the source, we can also declaratively specify the “layout” of the data available in the data source. This layout is in the form of a specified set of valid column expressions and what they represent. For example, column expressions for linked entity IDs, timestamp column, etc. can be specified.
inputs:
- name: salesforceTasks
contract:
is_optional: false
is_event_stream: true
with_entity_ids:
- user
with_columns:
- name: activitydate
- name: whoid
app_defaults:
table: salesforce.task
occurred_at_col: activitydate
ids:
# column name or sql expression
- select: "whoid"
type: salesforce_id
entity: user
to_default_stitcher: true
- name: salesforceContact
contract:
is_optional: false
is_event_stream: true
with_entity_ids:
- user
with_columns:
- name: createddate
- name: id
- name: email
app_defaults:
table: salesforce.contact
occurred_at_col: createddate
ids:
- select: "id"
type: salesforce_id
entity: user
to_default_stitcher: true
- select: "case when lower(email) like any ('%gmail%', '%yahoo%') then lower(email) else split_part(lower(email),'@',2) end"
type: email
entity: user
to_default_stitcher: true
- name: websitePageVisits
contract:
is_optional: false
is_event_stream: true
with_entity_ids:
- user
with_columns:
- name: timestamp
- name: anonymous_id
- name: context_traits_email
- name: user_id
app_defaults:
table: autotrack.pages
occurred_at_col: timestamp
ids:
- select: "anonymous_id"
type: rudder_anon_id
entity: user
to_default_stitcher: true
# below sql expression check the email type, if it is gmail and yahoo return email otherwise spilt email return domain of email.
- select: "case when lower(coalesce(context_traits_email, user_id)) like any ('%gmail%', '%yahoo%') then lower(coalesce(context_traits_email, user_id)) \
else split_part(lower(coalesce(context_traits_email, user_id)),'@',2) end"
type: email
entity: user
to_default_stitcher: true
In the inputs file we specify all the input tables on which the project depends. Input specification may also include metadata, including constraints on those columns.
Let’s look at schema of input YAML files to know more about them.
In each object under the inputs key, we specify the input tables on which the project depends. |
||||
type |
object |
|||
properties |
||||
|
A friendly name to be given to the input model. |
|||
type |
string |
|||
|
A contract has information about that model such as the columns it should have and the entity ID’s in it. This is crucial for other models that depend on it, as it helps find errors early and closer to the point of their origin. |
|||
type: list |
||||
properties |
||||
|
Whether the model’s existence in the warehouse is mandatory.. |
|||
type |
boolean |
|||
|
Whether the table/view is a series/stream of events or not. or not. A model is an event stream model if and only if it has a timestamp column. |
|||
type |
boolean |
|||
|
List of all entities with which this model is related. A model M1 is considered related to model M2 if there is an ID of model M2 among M1’s output columns. |
|||
type |
list |
|||
|
List of all columns that the model should have. |
|||
type |
list |
|||
|
Values the input defaults to, when the project is run directly. For library projects, inputs can be remapped and appdefaults overridden, when library projects are imported. |
|||
type: list |
||||
properties |
||||
|
The table in warehouse that holds the data. The table may be prefixed with external schema or database in the same warehouse, if applicable. Either table/view/s3/csv must be set, for each input source. |
|||
type |
string |
|||
|
The view in warehouse that holds the data. The view may be prefixed with external schema or database in the same warehouse, if applicable. Either table/view/s3/csv must be set, for each input source. |
|||
type |
string |
|||
|
The csv file on your s3 bucket that holds the data. The file path should be an s3 uri. Either table/view/s3/csv must be set, for each input source. |
|||
type |
string |
|||
|
The csv file on your local storage that holds the data. The file path should be relative to the project folder. Either table/view/s3/csv must be set, for each input source. |
|||
type |
string |
|||
|
The timestamp col. If a model is run for a particular timestamp t1, it will use this column to filter the data to exclude data after t1. I.e., Optional, but strongly recommended. |
|||
type |
string |
|||
|
Some of the input columns may contain IDs of associated entities. By their presence, such ID columns associate the row with the entity of the ID. The Id Stitcher may use these ID declarations to automatically discover ID to ID Edges. Here, we specify list of all IDs from the source table that are present, along with their column names(or column SQL expressions). |
|||
type: list |
||||
properties |
||||
|
Use column name to use the value in the column as ID. An SQL expression may also be specified, if some transformation is required. |
|||
type |
string |
|||
|
The ID type to use. All the ID types of a project are declared in pb_project.yaml and can specify additional filters on the column expression. Each ID type is linked only with a single entity and determines the entity associated. |
|||
type |
string |
|||
|
The entity, as defined in project file, to which the ID belongs to. |
|||
type |
string |
|||
|
Optional, it needs to be set to By default it is set to In your project file, if you remove the key and create a material view of the name |
|||
type |
boolean |
Note
You can also refer table from another Database/Schema in the same DW. For example, table: <database_name>.<schema_name>.<table_name>
.
Models
A model describes the steps towards creating an output table and the input sources whose data is used during those steps. A model file in YAML describes one or more models. The different kinds of models that can be created are:
ID Stitcher - Data comes from different sources and different sources may assign different IDs. To be able to track a user’s journey (or any other entity) uniquely across all these data sources, we need to stitch together all these IDs. Id stitching helps map different IDs of the same user (or any other entity) to a single canonical ID. It does this by doing connected component analysis over the Id-Id edge graph specified in its configuration. Click here for syntax.
ID Collator - ID Collator is similar to ID Stitcher. It is used when entity has only a single ID type associated (e.g. session IDs). In these cases, connected component analysis is not required and we use a simpler model type called ID Collator. It consolidates all entity IDs from multiple input tables into a single collated list.
Entity Vars - These are various attirbutes, related to the entity whose profile you’re trying to create (Name, City, LastVisitTimestamp, etc). Together, all the attributes can create a complete picture of the entity. Attribute is a single value derived by performing calculation or aggregation on a set of values. By default, every entity-var gets stored as a feature, such as days active, last seen, etc.
Input Vars - An input-var is similar to an entity-var. However, instead of a single value per entity ID, it represents a single value per row of an input model. Think of it as representing addition of an additional column to an input model. It can be used to define entity features. However, it is not itself an entity feature because it doesn’t represent a single value per entity ID.
Custom Models (Python) - One can build custom Python models for ML by downloading pre-defined Python templates. The results are usually saved as attributes of related entities (e.g. churnProbability).
Entity Traits 360 - Consider that features/traits of an entity are spread across multiple entity vars and ML models. Entity Traits 360 is a means to get them together into a single view. These models are usually defined in pb_project.yaml file, by creating entries under
serve_traits
key with corrosponding entity.SQL Template Models - Sometimes the standard model types provided by Profiles are insufficient to capture complex use cases. In those cases, we support the use of SQL template models. These models require user to explicit templatised SQL. SQL template models can be used as an input to an entity-var/ input-var or as an edge-source in id-stitcher.
Feature Table (Legacy) - Entity Vars specified here are unified into a view. Click here for syntax.
Note
If you do not define any specs for creating ID stitcher, then also it will be created. Profiles will create “default ID stitcher” which will take as input, all the sources and ID types defined in the file inputs.yaml. When you define specs, as below, it creates a “custom ID stitcher”.
The following YAML describes a group of vars named vars_list
. Also, it has two models: user_profile
is a feature table model, while user_python_model
is a custom model created using Python.
var_groups:
name: vars_list
entity_key: user # This is the name defined in project file. If we change that, we need to change the name here too.
vars:
- entity_var:
name: is_mql
select: max(case when salesForceLeadsTable.mql__c == 'True' then 1 else 0 end)
from: inputs/salesForceLeadsTable
description: Whether a domain is mql or not
- entity_var:
name: blacklistFlag
select: max(case when exclude_reason is not null then 1 else 0 end)
from: inputs/blacklistDomains
where: (context_sources_job_run_id = (select top 1 context_sources_job_run_id from blacklistDomains order by timestamp desc))
is_feature: false
- entity_var:
name: domainSummary_account_type
select: domainSummary.account_type
from: inputs/domainSummary
is_feature: false
- entity_var:
name: ignore_domain
select: case when {{user.Var("blacklistFlag")}} = 1 or {{user.Var("domainSummary_account_type")}} like '%free%' then 1 else 0 end
description: Whether a domain should be ignored for the analysis
- entity_var:
name: salesEvents
select: json_agg(activitydate, case when (type='Email' or tasksubtype = 'Email') then case when lower(subject) like '%[in]%' then 'sf_inbound_email' \
else 'sf_outbound_email' end when macro(call_conversion) then 'sf_call' else null end as event)
from: inputs/salesforceTasks
description: Salesforce touches are converted to one of following events - sf_inbound_email, sf_outbound_email, sf_call, null
is_feature: false
- entity_var:
name: page_events_var
select: json_agg(timestamp, case when split_part(trim(context_page_path, '/'), '/', 1) in var('valid_page_paths') then split_part(trim(context_page_path, '/'), '/', 1) \
else var('default_page_path') end as event)
from: inputs/websitePageVisits
where: timestamp < sales_conversion_timestamp and timestamp > var('start_date')
is_feature: false
- entity_var:
name: page_events
select: {{user.Var("page_events_var")}}
- entity_var:
name: webhookFormSubmit
select: min(timestamp)
from: inputs/webhookSource
where: variable_1 is null and timestamp < sales_conversion_timestamp and timestamp > var('start_date')
is_feature: false
- entity_var:
name: formSubmit
select: min({{user.Var("webhookFormSubmit")}}, websiteFormSubmit)
models:
- name: user_profile
model_type: feature_table_model
model_spec:
validity_time: 24h # 1 day
entity_key: user
features:
- is_mql
- ignore_domain
- sales_conversion_timestamp
- page_events
- formSubmit
- name: user_python_model
model_type: py_sql_model
model_spec:
occurred_at_col: insert_ts
validity_time: 24h
train_config:
prop1: "prop1"
prop2: "prop2"
Let’s look at schema of model YAML files to know more about them.
var_groups
type |
object |
|
properties |
||
|
Name of var_group, that will be used to uniquely identify it. |
|
type |
string |
|
|
Specify the entity to be used. |
|
type |
time |
|
|
This section is used to specify variables, with the help of entity_var and input_var. Aggregation on stitched ID type is done by default and is implicit. |
|
type |
list |
models
In each object under the models key, we specify the complete recipe to create the model required by the project. |
||
type |
object |
|
properties |
||
|
A friendly name to be given to the input model. |
|
type |
string |
|
|
Type determines the broad category of the model to be constructed. |
|
type |
string |
|
|
In the model spec, one can create a detailed configuration spec for the target model. Different schema is applicable for model_spec of different models. Check out the sections on Identity Stitching and Entity Traits 360 for detailed schema of supported models. |
|
type |
object |
Entity Traits 360 Models
Once traits are defined on an entity, we need means to use them. A primary application is for them to be sent to downstream destinations. The destination could either be the activation API or via any of the rETL destinations that RudderStack supports. Each such destination requires data in the form of a table with an ID column and 1 or more feature columns. This is possible using EntityTraits360.
EntityTraits360 model provides a way to access entity features based on any of the given id_types including the entity main id as the identifier column. It creates a view which will have all (or a specified set of) features on that entity from across the project.
To configure creation of a specific set of traits360 models, add serve_traits
section in the entity.
You need to provide a list of id_types as id_served
.
Optionally, you can also give a name which specifies the name of the generated model.
If you don’t specify a name, it will create the model with a default name.
By default, it will add all available features on the entity into the view.
In case you want a finer control, you can also include or exclude any features from any models by defining a custom entity-traits-360 model and add the reference to that the serve_traits section like model_served: models/name_of_custom_traits_360_model
.
Example of both are added below.
Default entity-traits-360 model:
pb_project.yaml:
...
entities:
- name: user
id_types:
- user_id
serve_traits:
- id_served: user_id
name: user_id_stitched_features
# This will add an entity-traits-360 model with user_id as the identifier with model name user_id_stitched_features.
# It will contain all the available features.
Custom entity-traits-360 model:
This is an example of custom entity-traits-360 model. Here we are including / excluding features from models of choice.
pb_project.yaml:
...
entities:
- name: user
id_types:
- user_id
serve_traits:
- id_served: user_id
model_served: models/cart_entity_traits_360
models/profiles.yaml:
models:
- name: cart_entity_traits_360
model_type: entity_traits_360
model_spec:
validity_time: 24h # 1 day
entity_key: user
id_served: user_id
feature_list:
- from: packages/pkg/models/cart_table # a table created by package
include: ["*"] # will include all the traits
- from: models/user_var_table
include: ["*"]
exclude: [cart_quantity, purchase_status] # except two, all the other traits will be included
- from: models/sql_model
include: [lifetime_value] # will include only one trait
Python Models
You can use Python to implement a new model type yourself. Check out this library for some officially supported model types with their Python source. Note that this is an experimental feature, so the implementation and interfaces can change significantly in the upcoming versions.
The requirements can be specified in model specs, akin to how they’re done in Python’s requirements.txt
file.
Below code snippet shows how the requirements such as for training and config can be specified in the project:
---profiles.yaml---
models:
- name: test_py_native_model
model_type: py_sql_model
model_spec:
occurred_at_col: insert_ts
validity_time: 24h
train_config:
prop1: "prop1"
prop2: "prop2"
Macros (optional)
Optional, you can create reusable functions for performing computatons and use in multiple models. Example of macros are shown below.
macros:
- name: subtract_range
inputs:
- first_date
- second_date
value: "{{first_date}} - {{second_date}}"
- name: multiplyBy10_add
inputs:
- first_number
- second_number
value: "{{first_number}} * 10 + {{second_number}}"
Here you can specify the macros that can be used later in Macros can be used to encapsulate complex processing logic directly within SQL that can be reused. You can consider these as functions. |
||
type: list |
||
properties |
||
|
Name of macro, this will be used to uniquely identify and reference this macro in |
|
type |
string |
|
|
If the logic you are trying to encapsulate works on some input variable, then you need to define those here. |
|
type |
list |
|
|
Actual logic, this must be a SQL expression. Any reference to the inputs within it should be surrounded by double curly brackets. |
|
type |
string |
Packages
Customers have the flexibility to utilize models from existing library projects, while also having the ability to define their own models and inputs within the PB project.
This approach allows for the seamless integration of library of pre-existing features, which are readily available and can be applied directly to data streamed into the warehouse.
In the absence of any explicitly defined models, the PB project is capable of compiling and running models from the library package given that inputs are present in the warehouse as assumed in the lib package.
Generally there will be deviations in terms of db name and schema name of input models which can be handled very easily by remapping inputs.
A sample pb_project.yaml file may appear as:
name: app_project
schema_version: 49
profile: test
packages:
- name: test_ft
gitUrl: "https://github.com/rudderlabs/librs360-shopify-features/tree/main"
In this case, the PB project imports a single package. It does not require a separate “models” folder or entities, as the input and output models will be sourced from the imported packages.
If non-mandatory inputs required by the model are not present in the warehouse, you can still run the model. If there is a deviation in the table/view name for input models, ie: if the inputs assumed in library package are present under some other name, remapping should be done. If some of the assumed inputs are not present at all, they should be remapped to nil. This way you can create and run imported packages with minimal set of inputs present.
For example, here we are importing a library package with the name of shopify_features.
packages:
- name: shopify_features
url: https://github.com/rudderlabs/librs360-shopify-features/tree/main
inputsMap:
rsCartCreate: inputs/rsWarehouseCartCreate
rsCartUpdate: inputs/rsCartUpdate
rsIdentifies: inputs/rsIdentifies
rsOrderCancelled: inputs/rsOrderCancelled
rsOrderCreated: inputs/rsOrderCreated
rsPages: nil
rsTracks: nil
In models/inputs.yaml, these inputs need to be defined with table names present in the warehouse.
inputs:
- name: rsWarehouseCartCreate
table: YOUR_DB.YOUR_SCHEMA.CART_CREATE_TABLE_NAME_IN_YOUR_WH
occurred_at_col: timestamp
ids:
- select: "anonymous_id"
type: anonymous_id
entity: user
source_metadata:
role: shopify
category: webhook
- name: rsIdentifies
table: YOUR_DB.YOUR_SCHEMA.IDENTIFIES
occurred_at_col: timestamp
ids:
- select: "user_id"
type: user_id
entity: user
- select: "anonymous_id"
type: anonymous_id
entity: user
- select: "lower(email)"
type: email
entity: user
source_metadata:
role: shopify
category: webhook
...
Note that the name of the table/view is changed to what is present in your warehouse.
If tables are present with the same name (including db name and schema name) then no remapping is required.
Available Packages
Following is list of packages that are currently available:
profiles-shopify-features
profiles-ecommerce-features
profiles-stripe-features
profiles-multieventstream-features
Contact RudderStack for access to these packages.
Note
Packages currently work on Snowflake.
Modify ID types
Extend existing package
In case you wish to add custom ID types to the default list or modify an existing one, then you may extend the package to include your specifications.
For the corresponding id_type, add the key extends:
followed by name of the same/different id_type that you wish to extend, and corresponding filters
with include/exclude values.
Below is an example of the same:
---|SampleProjectFile|.yaml---
packages:
- name: corelib
url: "https://github.com/rudderlabs/rudderstack-profiles-corelib/tag/schema_{{best_schema_version}}"
id_types:
- name: user_id
extends: user_id
filters:
- type: exclude
value: 123456
id_types:
- name: customer_id
extends: user_id
filters:
- type: include
regex: sample
id_types - Enlist the type of data to be used for creating ID stitcher / EntityVar / InputVar. Such as anonymous id’s that do not include the value “undefined” or email addresses in proper format.
extends - Name of the ID type that you wish to extend.
name - The type of data that will be fetched, such as email, user id, etc. It is different from what’s in the table column such as int or varchar.
filters - Filter(s) that the type should go through before being included. Filters are processed in order. Current filters enable one to include and exclude specific values or regexes.
Custom list of ID types
If you wish to have custom list of ID types other than what’s provided in default package, then you may remove it and add your list as below:
entities:
- name: user
id_types:
- user_id
- anonymous_id
- email
id_types:
- name: user_id
- name: anonymous_id
filters:
- type: exclude
value: ""
- type: exclude
value: "unknown"
- type: exclude
value: "NaN"
- name: email
filters:
- type: include
regex: "[A-Za-z0-9+_.-]+@(.+)"
Make sure that the ID types are also defined in the entity definition, as above.
Supported Git URLs
Profiles supports Git URLs for packages and scheduling via UI. The repos can be hosted at:
GitHub
GitLab
BitBucket
Please contact us if your preferred host isn’t included, and we will test and revert.
For private repos, we only support SSH Git URLs. One needs to add credentials to siteconfig.yaml
.
And, public ssh key needs to be manually added to the platforms.
The steps to add credentials are given here: Using private Git repos via CLI.
The URL scheme doesn’t depend on individual Git provider host. There are four kinds of Git URLs that can be used.
1. URL for the default branch of a repository
Syntax:
https://<provider-host>/<org-name>/<repo-name>/path/to/project
Example:
https://github.com/rudderlabs/librs360-shopify-features/shopify-features
https://gitlab.com/rudderlabs/librs360-shopify-features/shopify-features
https://bitbucket.org/rudderlabs/librs360-shopify-features/shopify-features
For private repos, we support ssh urls:
Syntax:
git@<provider-host>:<org-name>/<repo-name>/path/to/project
Example:
git@github.com:rudderlabs/librs360-shopify-features/shopify-features
git@gitlab.com:rudderlabs/librs360-shopify-features/shopify-features
git@gbitbucket.org:rudderlabs/librs360-shopify-features/shopify-features
- 2. URL for any specific branch of a repository
Syntax:
https://<provider-host>/<org-name>/<repo-name>/tree/<branch-name>/path/to/project
Example:
https://github.com/rudderlabs/librs360-shopify-features/tree/main/shopify-features
https://gitlab.com/rudderlabs/librs360-shopify-features/tree/main/shopify-features
https://bitbucket.org/rudderlabs/librs360-shopify-features/tree/main/shopify-features
- 3. URL for any specific tag within the repository
Syntax:
https://<provider-host>/<org-name>/<repo-name>/tag/<tag-name>/path/to/project
Example:
https://github.com/rudderlabs/librs360-shopify-features/tag/wht_test/shopify-features
https://gitlab.com/rudderlabs/librs360-shopify-features/tag/wht_test/shopify-features
https://bitbucket.org/rudderlabs/librs360-shopify-features/tag/wht_test/shopify-features
- 4. URL for any specific commit within the repository
Syntax:
https://<provider-host>/<org-name>/<repo-name>/commit/<commit-hash>/path/to/project
Example:
https://github.com/rudderlabs/librs360-shopify-features/commit/b8d49/shopify-features
https://gitlab.com/rudderlabs/librs360-shopify-features/commit/b8d49/shopify-features
https://bitbucket.org/rudderlabs/librs360-shopify-features/commit/b8d49/shopify-features
Warning
In web-app, we are supporting git ssh url with following pattern only:
git@<provider-host>:<org-name>/<repo-name>/tree/<branch-name>
git@<provider-host>:<org-name>/<repo-name>
git@<provider-host>:<org-name>/<repo-name>/tree/main/path/to/project
We are supporting any subfolder in git project without .git extension.
Access Control
Access control privileges determine which user can access which schemas/materials/registries and run which models in a warehouse. As it is directly related to the data available in the warehouse, all the warehouses have their own access control mechanisms, and the model(DAC/RBAC) used for access control depends on the warehouse also.
Snowflake
uses a combination of DAC and RBAC models for access control. Due to the use case of
PB where multiple users can launch the PB CLI, we have gone with the RBAC based access control
mechanism. Apart from this, we do not want the result of an Individual user run to be tied with
that user. So we suggest creating a role (say PROFILES_ROLE
) with the privileges explained in this
document.
The role PROFILES_ROLE
should have read access to all the inputs to the model, if
multiple schemas/tables are involved then the same role can be shared. A run will create
some materials so the role PROFILES_ROLE
must have write access to those schemas and the
common tables also e.g. Registry.
If these privileges are not assigned then the run will fail. If you want to access any material
created during run or as a result of the run, the role PROFILES_ROLE
should have read access to all
of those schemas.
Sample commands:
-- Create role
CREATE ROLE PROFILES_ROLE;
SHOW ROLES; -- To validate
-- Create user
CREATE USER PROFILES_TEST_USER PASSWORD='AlwaysKeepStrongPassword' DEFAULT_ROLE='PROFILES_ROLE';
SHOW USERS; -- To validate
-- Grant role to user and database
GRANT ROLE PROFILES_ROLE TO USER PROFILES_TEST_USER;
GRANT USAGE ON DATABASE YOUR_RUDDERSTACK_DB TO ROLE PROFILES_ROLE;
-- Create separate schema for Profiles and grant privileges to role
CREATE SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES;
GRANT ALL PRIVILEGES ON SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO ROLE PROFILES_ROLE;
GRANT USAGE ON WAREHOUSE RUDDER_WAREHOUSE TO ROLE PROFILES_ROLE;
GRANT USAGE ON SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO ROLE PROFILES_ROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
-- You can individually grant select on tables/views, or give blanket grant to all in schema.
GRANT SELECT ON ALL TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
When working with Redshift, privileges can be slightly different as compared to that of snowflake. This guide will help understanding the privileges required and how to get those.
Say our inputs/edge sources are only in one schema (though process can be repeated if they are in multiple schemas) and the name of schema is
website_eventstream
.Customer have created a new user for PB and the name is
rudderstack_admin
.
Requirements:
We will need a separate schema
rs_profiles
(to store all the common and output tables).rudderstack_admin
user should have all the privileges on this schema and the tables of this schema.rudderstack_admin
user should have USAGE privilege on the schemas which have edge sources (input tables) and read (SELECT) privileges on the specific tables as well. This would meanwebsite_eventstream
for now and then extend to the migration schema and other schema where data from warehouses would come in.rudderstack_admin
user should have privileges to use plpythonu to create some UDFs.
Sample commands:
CREATE USER rudderstack_admin WITH PASSWORD '<strong_unique_password>';
CREATE SCHEMA rs_profiles;
GRANT ALL ON SCHEMA "rs_profiles" TO rudderstack_admin;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA "rs_profiles" TO rudderstack_admin;
GRANT USAGE ON SCHEMA "website_eventstream" TO rudderstack_admin;
GRANT SELECT ON ALL TABLES IN SCHEMA "website_eventstream" TO rudderstack_admin;
GRANT USAGE ON LANGUAGE plpythonu TO rudderstack_admin;
In case you want to give access to only specific input tables/views referred in your Profiles project, use below command:
GRANT SELECT ON TABLE "<YOUR_SCHEMA>"."<YOUR_TABLE>" TO rudderstack_admin;
Open the Databricks UI.
For this purpose, we create a new user rudderstack_admin
.
You can either reuse an existing catalog or create a new one by clicking on “Create Catalog”.
And require USE CATALOG
privilege on the catalog.
We recommend creating a separate schema that will be able to write objects created by our software
and require all privileges on this RS_PROFILES
schema.
We require select privileges to access relevant schemas for input tables
and require select privileges to read input tables.
Suppose, one input table is input_table
in schema public
.
Note: If the input tables are in a different catalog, USE CATALOG
privilege
is required on that catalog as well.
We create Python UDFs in our code, so it is recommended to ensure that they are supported. By default, Databricks supports Python UDFs.
Sample Commands
GRANT USE SCHEMA ON CATALOG <catalog name> TO rudderstack_admin;
CREATE SCHEMA RS_PROFILES;
GRANT ALL PRIVILEGES ON SCHEMA RS_PROFILES TO rudderstack_admin;
GRANT SELECT ON SCHEMA public TO rudderstack_admin;
GRANT SELECT ON TABLE public.input_table TO rudderstack_admin;
Note
pb validate access
does a check of whether the specified role has access to read from all the input tables and can write to the schema for storing the output material tables.
If the privileges assigned to the role are sufficient, then the project can be executed.
Please see: CLI Reference for more information.
Timestamp of PB Model Inputs
Whenever RudderStack Events or Cloud Extract load data on a warehouse, then it mentions timestamp in a column specifying the date and time the data was loaded.
In case you wish to execute time-bound Profiles models, then you can add begin_time and end_time parameters to the command. For example you want to run Profiles models on last 6 months of data:
$ pb run --begin_time 2023-01-02T12:00:00.0Z
Say you want to run Profiles models for data between 2 May 2022 and 30 April 2023:
$ pb run --begin_time 2022-05-01T12:00:00.0Z --end_time 2023-04-30T12:00:00.0Z
In case you are running incrementally and wish to run Profiles models irrespective of timestamp, then you can add –rebase-incremental flag to the query.
$ pb run --rebase-incremental
Warning
Setting include_untimed: false
raises error if any inputs for the model cannot support timestamp filtering and the --begin_time
and --end_time
flags were specified.