Project Layout

Folder Structure

HelloPbProject

├── README.md

├── pb_project.yaml

└── models

├── inputs.yaml
├── profiles.yaml
└── macros.yaml (optional)

README.md

Contains quick overview of how to use along with SQL queries for data analysis.

pb_project.yaml

The project yaml file contains name, schema version, profile and entities name along with id types.

Here is an example of the yaml file:

# Project name
name: sample_attribution

# Project's yaml schema version
schema_version: 49

# WH Connection to use.
connection: test

# Whether to allow inputs having no timestamps,
# if true, data without timestamps are included when running models
include_untimed: true

# Model folders to use.
model_folders:
  - models

# Entities in this project and their ids.
entities:
  - name: user
    # Change the following to set a custom ID stitcher(optional).
    # id_stitcher: models/user_id_stitcher
    id_types:
      - main_id
      - user_id
      - anonymous_id
      - email
    # Entity traits 360, to get all features/traits of an entity into a single view (optional)
    serve_traits:
      - id_served: user_id
      - id_served: anonymous_id
      - id_served: email
      - id_served: user_main_id

# lib packages can be imported in project signifying that this project inherits its properties from there
packages:
  - name: corelib
    url: "https://github.com/rudderlabs/profiles-corelib/tag/schema_{{best_schema_version}}"
    # if required then you can extend the package definition such as for ID types.

# Profiles can also use certain model types defined in Python.
# Examples include ML models. Those dependencies are specified here.
python_requirements:
  - profiles-pycorelib==0.1.0

Let’s look at each field to know more about it.

name

It is the name of the project that you want to give.
type	string

schema_version

It is the project’s yaml version. Each new version of schema comes with improvements and added functionalities.
type	int

connection

Name of the connection from siteconfig.yaml file to be used for connecting to warehouse.

For more details, refer siteconfig.yaml.

type

string

include_untimed

Whether to allow inputs having no timestamps.

If true, data without timestamps are included when running models.

type

boolean

model_folders

Enter name of folder(s) where model files are stored.
type	string

entities

Lists all entities used in the project. An entity in DBMS is a real-world thing or a real-world object which is distinguishable from other such objects. Each entity has atleast one ID. Each entry for an entity here is a JSON object specifying entity’s name and attributes.
type: list
properties
name	Name of the entity used in the project
	type	string
id_stitcher	(Optional) modelRef of the ID stitcher, linked to the entity. Defined in case of custom ID stitcher. Prefix it with the relative path of models folder, e.g. `models/name_of_id_stitcher` To customize the ID stitcher, please refer Custom ID Stitcher.
	type	string
id_types	An entity may be associated with multiple ID types. Here we list all ID types associated with the current entity.
	type	list
serve_traits	(optional) List of all the ID’s being served. This is for Entity Traits 360 Models. Also, it’s required for Activation API to work.
	type	list

packages

Library packages can be imported in a project, signifying that this project inherits its properties from there.
type: list
properties
name	Specify a friendly name.
	type	string
url	https url of the lib package, with tag for best schema version.
	type	string

serve_traits

Please refer Entity Traits 360 Models.

id_types

Each entry is an object with ID type’s name and attributes. Optionally, you can also define filters for including or excluding certain values.
type: list
properties
name	Name of the ID type.
	type	string
extends	Optional, it is used to extend ID types specifications in a package.
	type	string
filters	We use filters to exclude or include some values from id types. Whenever the current Id type appears, it may need to be filtered or transformed. filters section contains type, value, and regex expression of the filter.
	type: list constraint: At a time, only one of regex or value field should be set.
	properties
	type	Valid filter types are ‘include’ and ‘exclude’. Include filters accept the IDs on match. Exclude reject the IDs on match.
		type	string
	value	A value to match. Usually used to reject certain invalid ID values like ‘NaN’, ‘unknown’, ‘test@domain.com’ etc.
		type	string
	regex	Regex to match the values with. May be used to both accept or reject matching IDs depending upon requirement.
		type	regex string
	sql	SQL statement with keys for `select:` and `from:` `select` has ID expression as SQL. `from:` specify path reference of the model to be used. For example: inputs/my_csv_input, models/my_sql_filters_model, etc.
		type	sql list

Inputs

A input file outlines how to obtain values for models from various tables. We specify the table/view along with column name and SQL expression for retrieving values from various tables/views/.:

inputs:
  - name: salesforceTasks
    contract:
      is_optional: false
      is_event_stream: true
      with_entity_ids:
        - user
      with_columns:
        - name: activitydate
        - name: whoid
    app_defaults:
      table: salesforce.task
      occurred_at_col: activitydate
      ids:
        # column name or sql expression
        - select: "whoid"
          type: salesforce_id
          entity: user
          to_default_stitcher: true
  - name: salesforceContact
    contract:
      is_optional: false
      is_event_stream: true
      with_entity_ids:
        - user
      with_columns:
        - name: createddate
        - name: id
        - name: email
    app_defaults:
      table: salesforce.contact
      occurred_at_col: createddate
      ids:
        - select: "id"
          type: salesforce_id
          entity: user
          to_default_stitcher: true
        - select: "case when lower(email) like any ('%gmail%', '%yahoo%') then lower(email)  else split_part(lower(email),'@',2) end"
          type: email
          entity: user
          to_default_stitcher: true
  - name: websitePageVisits
    contract:
      is_optional: false
      is_event_stream: true
      with_entity_ids:
        - user
      with_columns:
        - name: timestamp
        - name: anonymous_id
        - name: context_traits_email
        - name: user_id
    app_defaults:
      table: autotrack.pages
      occurred_at_col: timestamp
      ids:
        - select: "anonymous_id"
          type: rudder_anon_id
          entity: user
          to_default_stitcher: true
        # below sql expression check the email type, if it is gmail and yahoo return email otherwise spilt email return domain of email.
        - select: "case when lower(coalesce(context_traits_email, user_id)) like any ('%gmail%', '%yahoo%') then lower(coalesce(context_traits_email, user_id))  \
              else split_part(lower(coalesce(context_traits_email, user_id)),'@',2) end"
          type: email
          entity: user
          to_default_stitcher: true

In the inputs file we specify all the input tables on which the project depends. Input specification may also include metadata, including constraints on those columns.

Let’s look at schema of input YAML files to know more about them.

In each object under the inputs key, we specify the input tables on which the project depends.
type	object
properties
name	A friendly name to be given to the input model.
	type	string
contract	A contract has information about that model such as the columns it should have and the entity ID’s in it. This is crucial for other models that depend on it, as it helps find errors early and closer to the point of their origin.
	type: list
	properties
	is_optional	Whether the model’s existence in the warehouse is mandatory..
		type	boolean
	is_event_stream	Whether the table/view is a series/stream of events or not. or not. A model is an event stream model if and only if it has a timestamp column.
		type	boolean
	with_entity_ids	List of all entities with which this model is related. A model M1 is considered related to model M2 if there is an ID of model M2 among M1’s output columns.
		type	list
	with_columns	List of all columns that the model should have.
		type	list
app_defaults	Values the input defaults to, when the project is run directly. For library projects, inputs can be remapped and appdefaults overridden, when library projects are imported.
	type: list
	properties
	table	The table in warehouse that holds the data. The table may be prefixed with external schema or database in the same warehouse, if applicable. Either table/view/csv must be set, for each input source.
		type	string
	view	The view in warehouse that holds the data. The view may be prefixed with external schema or database in the same warehouse, if applicable. Either table/view/csv must be set, for each input source.
		type	string
	csv	The csv file on your local storage that holds the data. The file path should be relative to the project folder. Either table/view/csv must be set, for each input source.
		type	string
	occurred_at_col	The timestamp col. If a model is run for a particular timestamp t1, it will use this column to filter the data to exclude data after t1. I.e., `... WHERE <timestamp_col> < t1`. Optional, but strongly recommended.
		type	string
	ids	Some of the input columns may contain IDs of associated entities. By their presence, such ID columns associate the row with the entity of the ID. The Id Stitcher may use these ID declarations to automatically discover ID to ID Edges. Here, we specify list of all IDs from the source table that are present, along with their column names(or column SQL expressions).
		type: list
		properties
		select	Use column name to use the value in the column as ID. An SQL expression may also be specified, if some transformation is required.
			type	string
		type	The ID type to use. All the ID types of a project are declared in pb_project.yaml and can specify additional filters on the column expression. Each ID type is linked only with a single entity and determines the entity associated.
			type	string
		entity	The entity, as defined in project file, to which the ID belongs to.
			type	string
		to_default_stitcher	Optional, it needs to be set to `true` for an ID to be picked in the default ID stitcher. By default it is set to `false`, without impacting if the project is using custom ID stitcher. In your project file, if you remove the key `id_stitcher: models/<name of ID stitcher model>`, then it’ll use the default ID stitcher and create a material view of the name `<entity_name>_default_id_stitcher`.
			type	boolean

Note

You can also refer table from another Database/Schema in the same DW. For example, table: <database_name>.<schema_name>.<table_name>.

Models

A model describes the steps towards creating an output table and the input sources whose data is used during those steps. A model file in YAML describes one or more models. The different kinds of models that can be created are:

ID Stitcher - Data comes from different sources and different sources may assign different IDs. To be able to track a user’s journey (or any other entity) uniquely across all these data sources, we need to stitch together all these IDs. Id stitching helps map different IDs of the same user (or any other entity) to a single canonical ID. It does this by doing connected component analysis over the Id-Id edge graph specified in its configuration. Click here for syntax.
ID Collator - ID Collator is similar to ID Stitcher. It is used when entity has only a single ID type associated (e.g. session IDs). In these cases, connected component analysis is not required and we use a simpler model type called ID Collator. It consolidates all entity IDs from multiple input tables into a single collated list.
Entity Vars - These are various attirbutes, related to the entity whose profile you’re trying to create (Name, City, LastVisitTimestamp, etc). Together, all the attributes can create a complete picture of the entity. Attribute is a single value derived by performing calculation or aggregation on a set of values. By default, every entity-var gets stored as a feature, such as days active, last seen, etc.
Input Vars - An input-var is similar to an entity-var. However, instead of a single value per entity ID, it represents a single value per row of an input model. Think of it as representing addition of an additional column to an input model. It can be used to define entity features. However, it is not itself an entity feature because it doesn’t represent a single value per entity ID.
Custom Models (Python) - One can build custom Python models for ML by downloading pre-defined Python templates. The results are usually saved as attributes of related entities (e.g. churnProbability).
Entity Traits 360 - Consider that features/traits of an entity are spread across multiple entity vars and ML models. Entity Traits 360 is a means to get them together into a single view. These models are usually defined in pb_project.yaml file, by creating entries under serve_traits key with corrosponding entity.
SQL Template Models - Sometimes the standard model types provided by Profiles are insufficient to capture complex use cases. In those cases, we support the use of SQL template models. These models require user to explicit templatised SQL. SQL template models can be used as an input to an entity-var/ input-var or as an edge-source in id-stitcher.
Feature Table (Legacy) - Entity Vars specified here are unified into a view. Click here for syntax.

Note

If you do not define any specs for creating ID stitcher, then also it will be created. Profiles will create “default ID stitcher” which will take as input, all the sources and ID types defined in the file inputs.yaml. When you define specs, as below, it creates a “custom ID stitcher”.

The following YAML describes a group of vars named vars_list. Also, it has two models: user_profile is a feature table model, while user_python_model is a custom model created using Python.

var_groups:
  name: vars_list
  entity_key: user # This is the name defined in project file. If we change that, we need to change the name here too.
  vars:
    - entity_var:
        name: is_mql
        select: max(case when salesForceLeadsTable.mql__c == 'True' then 1 else 0 end)
        from: inputs/salesForceLeadsTable
        description: Whether a domain is mql or not
    - entity_var:
        name: blacklistFlag
        select: max(case when exclude_reason is not null then 1 else 0 end)
        from: inputs/blacklistDomains
        where: (context_sources_job_run_id = (select top 1 context_sources_job_run_id from blacklistDomains order by timestamp desc))
        is_feature: false
    - entity_var:
        name: domainSummary_account_type
        select: domainSummary.account_type
        from: inputs/domainSummary
        is_feature: false
    - entity_var:
        name: ignore_domain
        select: case when {{user.Var("blacklistFlag")}} = 1 or {{user.Var("domainSummary_account_type")}} like '%free%' then 1 else 0 end
        description: Whether a domain should be ignored for the analysis
    - entity_var:
        name: salesEvents
        select: json_agg(activitydate, case when (type='Email' or tasksubtype = 'Email') then case when lower(subject) like '%[in]%' then 'sf_inbound_email' \
              else 'sf_outbound_email' end when macro(call_conversion) then 'sf_call' else null end as event)
        from: inputs/salesforceTasks
        description: Salesforce touches are converted to one of following events - sf_inbound_email, sf_outbound_email, sf_call, null
        is_feature: false
    - entity_var:
        name: page_events_var
        select: json_agg(timestamp, case when split_part(trim(context_page_path, '/'), '/', 1) in var('valid_page_paths') then split_part(trim(context_page_path, '/'), '/', 1) \
              else var('default_page_path') end as event)
        from: inputs/websitePageVisits
        where: timestamp < sales_conversion_timestamp and timestamp > var('start_date')
        is_feature: false
    - entity_var:
        name: page_events
        select: {{user.Var("page_events_var")}}
    - entity_var:
        name: webhookFormSubmit
        select:  min(timestamp)
        from: inputs/webhookSource
        where: variable_1 is null and timestamp < sales_conversion_timestamp and timestamp > var('start_date')
        is_feature: false
    - entity_var:
        name: formSubmit
        select: min({{user.Var("webhookFormSubmit")}}, websiteFormSubmit)
models:
  - name: user_profile
    model_type: feature_table_model
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      features:
        - is_mql
        - ignore_domain
        - sales_conversion_timestamp
        - page_events
        - formSubmit

  - name: user_python_model
    model_type: py_sql_model
    model_spec:
      occurred_at_col: insert_ts
      validity_time: 24h
      train_config:
        prop1: "prop1"
        prop2: "prop2"

Let’s look at schema of model YAML files to know more about them.

var_groups

type	object
properties
name	Name of var_group, that will be used to uniquely identify it.
	type	string
entity_key	Specify the entity to be used.
	type	time
vars	This section is used to specify variables, with the help of entity_var and input_var. Aggregation on stitched ID type is done by default and is implicit.
	type	list

models

In each object under the models key, we specify the complete recipe to create the model required by the project.
type	object
properties
name	A friendly name to be given to the input model.
	type	string
model_type	Type determines the broad category of the model to be constructed.
	type	string
model_spec	In the model spec, one can create a detailed configuration spec for the target model. Different schema is applicable for model_spec of different models. Check out the sections on Identity Stitching and Entity Traits 360 for detailed schema of supported models.
	type	object

Entity Traits 360 Models

Once traits are defined on an entity, we need means to use them. A primary application is for them to be sent to downstream destinations. The destination could either be the activation API or via any of the rETL destinations that RudderStack supports. Each such destination requires data in the form of a table with an ID column and 1 or more feature columns. This is possible using EntityTraits360.

EntityTraits360 model provides a way to access entity features based on any of the given id_types including the entity main id as the identifier column. It creates a view which will have all (or a specified set of) features on that entity from across the project.

To configure creation of a specific set of traits360 models, add serve_traits section in the entity. You need to provide a list of id_types as id_served. Optionally, you can also give a name which specifies the name of the generated model. If you don’t specify a name, it will create the model with a default name. By default, it will add all available features on the entity into the view.

In case you want a finer control, you can also include or exclude any features from any models by defining a custom entity-traits-360 model and add the reference to that the serve_traits section like model_served: models/name_of_custom_traits_360_model.

Example of both are added below.

Default entity-traits-360 model:

pb_project.yaml:

...
entities:
  - name: user
    id_types:
      - user_id
    serve_traits:
      - id_served: user_id
        name: user_id_stitched_features
        # This will add an entity-traits-360 model with user_id as the identifier with model name user_id_stitched_features.
        # It will contain all the available features.

Custom entity-traits-360 model:

This is an example of custom entity-traits-360 model. Here we are including / excluding features from models of choice.

pb_project.yaml:

...
entities:
  - name: user
    id_types:
      - user_id
    serve_traits:
      - id_served: user_id
        model_served: models/cart_entity_traits_360

models/profiles.yaml:

models:
  - name: cart_entity_traits_360
    model_type: entity_traits_360
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      id_served: user_id
      feature_list:
        - from: packages/pkg/models/cart_table # a table created by package
          include: ["*"] # will include all the traits
        - from: models/user_var_table
          include: ["*"]
          exclude: [cart_quantity, purchase_status] # except two, all the other traits will be included
        - from: models/sql_model
          include: [lifetime_value] # will include only one trait

Python Models

You can use Python to implement a new model type yourself. Check out this library for some officially supported model types with their Python source. Note that this is an experimental feature, so the implementation and interfaces can change significantly in the upcoming versions.

The requirements can be specified in model specs, akin to how they’re done in Python’s requirements.txt file. Below code snippet shows how the requirements such as for training and config can be specified in the project:

---profiles.yaml---
models:
  - name: test_py_native_model
    model_type: py_sql_model
    model_spec:
      occurred_at_col: insert_ts
      validity_time: 24h
      train_config:
        prop1: "prop1"
        prop2: "prop2"

Macros (optional)

Optional, you can create reusable functions for performing computatons and use in multiple models. Example of macros are shown below.

macros:
  - name: subtract_range
    inputs:
      - first_date
      - second_date
    value: "{{first_date}} - {{second_date}}"
  - name: multiplyBy10_add
    inputs:
      - first_number
      - second_number
    value: "{{first_number}} * 10 + {{second_number}}"

Here you can specify the macros that can be used later in `features` section as helpers. Macros can be used to encapsulate complex processing logic directly within SQL that can be reused. You can consider these as functions.
type: list
properties
name	Name of macro, this will be used to uniquely identify and reference this macro in `features` section.
	type	string
inputs	If the logic you are trying to encapsulate works on some input variable, then you need to define those here.
	type	list
value	Actual logic, this must be a SQL expression. Any reference to the inputs within it should be surrounded by double curly brackets.
	type	string

Packages

Customers have the flexibility to utilize models from existing library projects, while also having the ability to define their own models and inputs within the PB project.

This approach allows for the seamless integration of library of pre-existing features, which are readily available and can be applied directly to data streamed into the warehouse.

In the absence of any explicitly defined models, the PB project is capable of compiling and running models from the library package given that inputs are present in the warehouse as assumed in the lib package.

Generally there will be deviations in terms of db name and schema name of input models which can be handled very easily by remapping inputs.

A sample pb_project.yaml file may appear as:

name: app_project
schema_version: 49
profile: test

packages:
  - name: test_ft
    gitUrl: "https://github.com/rudderlabs/librs360-shopify-features/tree/main"

In this case, the PB project imports a single package. It does not require a separate “models” folder or entities, as the input and output models will be sourced from the imported packages.

If non-mandatory inputs required by the model are not present in the warehouse, you can still run the model. If there is a deviation in the table/view name for input models, ie: if the inputs assumed in library package are present under some other name, remapping should be done. If some of the assumed inputs are not present at all, they should be remapped to nil. This way you can create and run imported packages with minimal set of inputs present.

For example, here we are importing a library package with the name of shopify_features.

packages:
  - name: shopify_features
    url: https://github.com/rudderlabs/librs360-shopify-features/tree/main
    inputsMap:
      rsCartCreate: inputs/rsWarehouseCartCreate
      rsCartUpdate: inputs/rsCartUpdate
      rsIdentifies: inputs/rsIdentifies
      rsOrderCancelled: inputs/rsOrderCancelled
      rsOrderCreated: inputs/rsOrderCreated
      rsPages: nil
      rsTracks: nil

In models/inputs.yaml, these inputs need to be defined with table names present in the warehouse.

inputs:
  - name: rsWarehouseCartCreate
    table: YOUR_DB.YOUR_SCHEMA.CART_CREATE_TABLE_NAME_IN_YOUR_WH
    occurred_at_col: timestamp
    ids:
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
    source_metadata:
      role: shopify
      category: webhook
  - name: rsIdentifies
    table: YOUR_DB.YOUR_SCHEMA.IDENTIFIES
    occurred_at_col: timestamp
    ids:
      - select: "user_id"
        type: user_id
        entity: user
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
      - select: "lower(email)"
        type: email
        entity: user
    source_metadata:
      role: shopify
      category: webhook
  ...

Note that the name of the table/view is changed to what is present in your warehouse.

If tables are present with the same name (including db name and schema name) then no remapping is required.

Available Packages

Following is list of packages that are currently available:

profiles-corelib
profiles-base-features
profiles-shopify-features
profiles-ecommerce-features
profiles-stripe-features
profiles-multieventstream-features

Contact RudderStack for access to these packages.

Note

Packages currently work on Snowflake.

Modify ID types

Extend existing package

In case you wish to add custom ID types to the default list or modify an existing one, then you may extend the package to include your specifications. For the corresponding id_type, add the key extends: followed by name of the same/different id_type that you wish to extend, and corresponding filters with include/exclude values. Below is an example of the same:

---|SampleProjectFile|.yaml---
packages:
  - name: corelib
    url: "https://github.com/rudderlabs/rudderstack-profiles-corelib/tag/schema_{{best_schema_version}}"
id_types:
  - name: user_id
    extends: user_id
    filters:
      - type: exclude
        value: 123456
id_types:
  - name: customer_id
    extends: user_id
    filters:
      - type: include
        regex: sample

id_types - Enlist the type of data to be used for creating ID stitcher / EntityVar / InputVar. Such as anonymous id’s that do not include the value “undefined” or email addresses in proper format.
- extends - Name of the ID type that you wish to extend.
- name - The type of data that will be fetched, such as email, user id, etc. It is different from what’s in the table column such as int or varchar.
- filters - Filter(s) that the type should go through before being included. Filters are processed in order. Current filters enable one to include and exclude specific values or regexes.

Custom list of ID types

If you wish to have custom list of ID types other than what’s provided in default package, then you may remove it and add your list as below:

entities:
  - name: user
    id_types:
      - user_id
      - anonymous_id
      - email

id_types:
  - name: user_id
  - name: anonymous_id
    filters:
      - type: exclude
        value: ""
      - type: exclude
        value: "unknown"
      - type: exclude
        value: "NaN"
  - name: email
    filters:
    - type: include
      regex: "[A-Za-z0-9+_.-]+@(.+)"

Make sure that the ID types are also defined in the entity definition, as above.

Supported Git URLs

Profiles supports Git URLs for packages and scheduling via UI. The repos can be hosted at:

GitHub

GitLab

BitBucket

Please contact us if your preferred host isn’t included, and we will test and revert.

For private repos, we only support SSH Git URLs. One needs to add credentials to siteconfig.yaml. And, public ssh key needs to be manually added to the platforms. The steps to add credentials are given here: Using private Git repos via CLI.

The URL scheme doesn’t depend on individual Git provider host. There are four kinds of Git URLs that can be used.

1. URL for the default branch of a repository

Syntax:

https://<provider-host>/<org-name>/<repo-name>/path/to/project

Example:

https://github.com/rudderlabs/librs360-shopify-features/shopify-features https://gitlab.com/rudderlabs/librs360-shopify-features/shopify-features https://bitbucket.org/rudderlabs/librs360-shopify-features/shopify-features

For private repos, we support ssh urls:

Syntax:

git@<provider-host>:<org-name>/<repo-name>/path/to/project

Example:

git@github.com:rudderlabs/librs360-shopify-features/shopify-features git@gitlab.com:rudderlabs/librs360-shopify-features/shopify-features git@gbitbucket.org:rudderlabs/librs360-shopify-features/shopify-features

2. URL for any specific branch of a repository

Syntax:

https://<provider-host>/<org-name>/<repo-name>/tree/<branch-name>/path/to/project
Example:

https://github.com/rudderlabs/librs360-shopify-features/tree/main/shopify-features https://gitlab.com/rudderlabs/librs360-shopify-features/tree/main/shopify-features https://bitbucket.org/rudderlabs/librs360-shopify-features/tree/main/shopify-features

3. URL for any specific tag within the repository

Syntax:

https://<provider-host>/<org-name>/<repo-name>/tag/<tag-name>/path/to/project
Example:

https://github.com/rudderlabs/librs360-shopify-features/tag/wht_test/shopify-features https://gitlab.com/rudderlabs/librs360-shopify-features/tag/wht_test/shopify-features https://bitbucket.org/rudderlabs/librs360-shopify-features/tag/wht_test/shopify-features

4. URL for any specific commit within the repository

Syntax:

https://<provider-host>/<org-name>/<repo-name>/commit/<commit-hash>/path/to/project
Example:

https://github.com/rudderlabs/librs360-shopify-features/commit/b8d49/shopify-features https://gitlab.com/rudderlabs/librs360-shopify-features/commit/b8d49/shopify-features https://bitbucket.org/rudderlabs/librs360-shopify-features/commit/b8d49/shopify-features

Warning

In web-app, we are supporting git ssh url with following pattern only:

git@<provider-host>:<org-name>/<repo-name>/tree/<branch-name>
git@<provider-host>:<org-name>/<repo-name>
git@<provider-host>:<org-name>/<repo-name>/tree/main/path/to/project

We are supporting any subfolder in git project without .git extension.

Access Control

Access control privileges determine which user can access which schemas/materials/registries and run which models in a warehouse. As it is directly related to the data available in the warehouse, all the warehouses have their own access control mechanisms, and the model(DAC/RBAC) used for access control depends on the warehouse also.

Snowflake uses a combination of DAC and RBAC models for access control. Due to the use case of PB where multiple users can launch the PB CLI, we have gone with the RBAC based access control mechanism. Apart from this, we do not want the result of an Individual user run to be tied with that user. So we suggest creating a role (say PROFILES_ROLE) with the privileges explained in this document.

The role PROFILES_ROLE should have read access to all the inputs to the model, if multiple schemas/tables are involved then the same role can be shared. A run will create some materials so the role PROFILES_ROLE must have write access to those schemas and the common tables also e.g. Registry.

If these privileges are not assigned then the run will fail. If you want to access any material created during run or as a result of the run, the role PROFILES_ROLE should have read access to all of those schemas.

Sample commands:

-- Create role
CREATE ROLE PROFILES_ROLE;
SHOW ROLES; -- To validate

-- Create user
CREATE USER PROFILES_TEST_USER PASSWORD='AlwaysKeepStrongPassword' DEFAULT_ROLE='PROFILES_ROLE';
SHOW USERS; -- To validate

-- Grant role to user and database
GRANT ROLE PROFILES_ROLE TO USER PROFILES_TEST_USER;
GRANT USAGE ON DATABASE YOUR_RUDDERSTACK_DB TO ROLE PROFILES_ROLE;

-- Create separate schema for Profiles and grant privileges to role
CREATE SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES;
GRANT ALL PRIVILEGES ON SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO ROLE PROFILES_ROLE;
GRANT USAGE ON WAREHOUSE RUDDER_WAREHOUSE TO ROLE PROFILES_ROLE;
GRANT USAGE ON SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO ROLE PROFILES_ROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.RS_PROFILES TO PROFILES_ROLE;

-- You can individually grant select on tables/views, or give blanket grant to all in schema.
GRANT SELECT ON ALL TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA YOUR_RUDDERSTACK_DB.EVENTSSCHEMA TO PROFILES_ROLE;

When working with Redshift, privileges can be slightly different as compared to that of snowflake. This guide will help understanding the privileges required and how to get those.

Say our inputs/edge sources are only in one schema (though process can be repeated if they are in multiple schemas) and the name of schema is website_eventstream.
Customer have created a new user for PB and the name is rudderstack_admin.

Requirements:

We will need a separate schema rs_profiles (to store all the common and output tables).
rudderstack_admin user should have all the privileges on this schema and the tables of this schema.
rudderstack_admin user should have USAGE privilege on the schemas which have edge sources (input tables) and read (SELECT) privileges on the specific tables as well. This would mean website_eventstream for now and then extend to the migration schema and other schema where data from warehouses would come in.
rudderstack_admin user should have privileges to use plpythonu to create some UDFs.

Sample commands:

CREATE USER rudderstack_admin WITH PASSWORD '<strong_unique_password>';
CREATE SCHEMA rs_profiles;
GRANT ALL ON SCHEMA "rs_profiles" TO rudderstack_admin;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA "rs_profiles" TO rudderstack_admin;
GRANT USAGE ON SCHEMA "website_eventstream" TO rudderstack_admin;
GRANT SELECT ON ALL TABLES IN SCHEMA "website_eventstream" TO rudderstack_admin;
GRANT USAGE ON LANGUAGE plpythonu TO rudderstack_admin;

In case you want to give access to only specific input tables/views referred in your Profiles project, use below command:

GRANT SELECT ON TABLE "<YOUR_SCHEMA>"."<YOUR_TABLE>" TO rudderstack_admin;

Open the Databricks UI.

For this purpose, we create a new user rudderstack_admin.

You can either reuse an existing catalog or create a new one by clicking on “Create Catalog”. And require USE CATALOG privilege on the catalog.

We recommend creating a separate schema that will be able to write objects created by our software and require all privileges on this RS_PROFILES schema.

We require select privileges to access relevant schemas for input tables and require select privileges to read input tables. Suppose, one input table is input_table in schema public. Note: If the input tables are in a different catalog, USE CATALOG privilege is required on that catalog as well.

We create Python UDFs in our code, so it is recommended to ensure that they are supported. By default, Databricks supports Python UDFs.

Sample Commands

GRANT USE SCHEMA ON CATALOG <catalog name> TO rudderstack_admin;
CREATE SCHEMA RS_PROFILES;
GRANT ALL PRIVILEGES ON SCHEMA RS_PROFILES TO rudderstack_admin;
GRANT SELECT ON SCHEMA public TO rudderstack_admin;
GRANT SELECT ON TABLE public.input_table TO rudderstack_admin;

Note

pb validate access does a check of whether the specified role has access to read from all the input tables and can write to the schema for storing the output material tables. If the privileges assigned to the role are sufficient, then the project can be executed. Please see: CLI Reference for more information.

Timestamp of PB Model Inputs

Whenever RudderStack Events or Cloud Extract load data on a warehouse, then it mentions timestamp in a column specifying the date and time the data was loaded.

In case you wish to execute time-bound Profiles models, then you can add begin_time and end_time parameters to the command. For example you want to run Profiles models on last 6 months of data:

$ pb run --begin_time 2023-01-02T12:00:00.0Z

Say you want to run Profiles models for data between 2 May 2022 and 30 April 2023:

$ pb run --begin_time 2022-05-01T12:00:00.0Z --end_time 2023-04-30T12:00:00.0Z

In case you are running incrementally and wish to run Profiles models irrespective of timestamp, then you can add –rebase-incremental flag to the query.

$ pb run --rebase-incremental

Warning

Setting include_untimed: false raises error if any inputs for the model cannot support timestamp filtering and the --begin_time and --end_time flags were specified.