Quickstart

Installation

Execute this command on your MacOS Terminal / Linux Shell / Windows PowerShell:

$ pip3 install profiles-rudderstack

$ pip3 install profiles-rudderstack -U

As a prerequisite, you should have installed Python3 and set its environment variables.

To verify that you’ve installed the latest version (0.10.6), execute this command after installation is complete:

$ pb version

If you face issues then check out the section Setup & Installation.

Note

In case you are an existing user, then you’ll have to migrate your project to new schema. Please refer the section Migrate your existing project.

Warning

To install PB, the Python version must be between 3.8 and 3.10.

Set WH Connection

Data warehouse details and credentials are specified in the connection file. You can create it using the command:

$ pb init connection

This will ask you to input connection details and will then store them in the file siteconfig.yaml inside .pb folder of user’s home directory (~/.pb), under the YAML node connections.

First, you will have to select the warehouse you’re creating connection for (‘s’ for Snowflake, ‘r’ for Redshift). Then, enter the following details:

Connection Name: Assign any name here such as test, prodprofile, etc. It will be specified in PB Project file, as mentioned below.
Target: This refers to the environment, such as dev, prod, test etc. You can use any term in a target (eg. dev1, dev2, dev3 etc.) and create separate connections for them.
User: Data Warehouse user you will create to provide necessary privileges.
Password: Data warehouse password for the user.
Role: Role you will have to create with necessary privileges. This is the role that should be associated with the user.
Account: An example is abx182.us-east-2. To know about account identifier, click here.
Warehouse: Name of the warehouse.
Database name: Name of the database inside warehouse where model outputs will be written.
Schema: Name of the schema inside database where you’ll store ID stitcher and entity features.

Connection Name: Assign any name here such as test, prodprofile, etc. It will be specified in PB Project file, as mentioned below.
Target: This refers to the environment, such as dev, prod, test etc. You can use any term in a target (eg. dev1, dev2, dev3 etc.) and create separate connections for them.
Host: An example is warehouseabc.us-west-1.redshift.amazonaws.com. To know about host, go to the AWS Console and select your cluster under Clusters.
Port: Port number to connect to warehouse, e.g. 5129.
Database name: Name of the database inside warehouse where model outputs will be written.
Schema: Name of the schema inside database where you’ll store ID stitcher and entity features.
User: Data Warehouse user you will create to provide necessary privileges.
Password: Data warehouse password for the user.

For establishing a successful connection, please ensure that the information entered is accurate.

Note

To enable access, make sure you are an ACCOUNTADMIN, or have an account that has READ and WRITE privileges. For more information, refer the section Access Control.

Create Your First Project

Instantiate Hello World project using:

$ pb init pb-project

This will create a seed project in folder HelloPbProject. Navigate to that folder in Terminal:

$ cd HelloPbProject

Now, edit the file pb_project.yaml. A sample project file is shown below:

# Project name

name: sample-attribution


# Project's yaml schema version.

schema_version: 49


# WH Connection to use.

connection: test

# Whether to allow inputs having no timestamps,
# if true, data without timestamps are included when running models

include_untimed: true


# Model folders to use.

model_folders:
  - models # Name of subfolder inside your project folder, that contains YAML model files


# Entities in this project and their ids.

entities:
  - name: user
    id_types:
      - user_id
      - anonymous_id
      - email
packages:
  - name: corelib
    url: "https://github.com/rudderlabs/rudderstack-profiles-corelib/tag/schema_{{best_schema_version}}"

name: Give name of your project.
schema_version: No need to change it right now (we’ll come to that later).
connection: The warehouse connection you are going to use (see the Set WH Connection section above).
include_untimed: Set it to true to include source data that do not have any timestamp values. If you set it to false then all values with null values in occurred_at_col will be excluded.
model_folders: Relative path where your model files are stored. You can keep it untouched.
entities: All the entities that will be used in this project. Different entities can reuse same ID types.

name - Set it to user.

id_types - Enlist the type of data to be used for creating ID stitcher / Entity Features. Such as anonymous id’s that do not include the value “undefined” or email addresses in proper format.

packages - Library packages can be imported in project signifying that this project inherits its properties from there
- name - Give any friendly name.
- url - If required then you can extend the package definition such as for ID types.

Note

For debugging, you may refer the section YAML.

Inputs Model

The file models/inputs.yaml contains details of input sources (such as tables / views), from which data is to be fetched. Names are assigned to input sources, which can be used for creating specifications (see the next file). Here’s an example of inputs model file.

inputs:
  - name: rsIdentifies
    contract:
      is_optional: false
      is_event_stream: true
      with_entity_ids:
        - user
      with_columns:
        - name: timestamp
        - name: user_id
        - name: anonymous_id
        - name: email
    app_defaults:
      table: rudder_events_production.web.identifies
      occurred_at_col: timestamp
      ids:
        - select: "user_id"
          type: user_id
          entity: user
          to_default_stitcher: true
        - select: "anonymous_id"
          type: anonymous_id
          entity: user
          to_default_stitcher: true
        - select: "lower(email)"
          type: email
          entity: user
          to_default_stitcher: true
  - name: rsTracks
    contract:
      is_optional: false
      is_event_stream: true
      with_entity_ids:
        - user
      with_columns:
        - name: timestamp
        - name: user_id
        - name: anonymous_id
    app_defaults:
      table: rudder_events_production.web.tracks
      occurred_at_col: timestamp
      ids:
        - select: "user_id"
          type: user_id
          entity: user
          to_default_stitcher: true
        - select: "anonymous_id"
          type: anonymous_id
          entity: user
          to_default_stitcher: true

inputs - List all of the input sources along with the SQL for fetching them.
- name - Give a name in which you will store the fetched data.
- contract - A Model Contract provides essential information about the model, including the necessary columns and entity ID’s it must contain. This is crucial for other models that depend on it, as it helps find errors early and closer to the point of their origin.
  - is_optional - This flag indicates whether the model’s existence in the warehouse is mandatory or not.
  - is_event_stream - This identifies whether the table or view represents a continuous series of event or not. A model is an event stream model if and only if it has a timestamp column.
  - with_entity_ids - This is the list of entities’ with which this model is related. A model M1 is considered related to model M2 if there is an ID of model M2 among M1’s output columns.
  - with_columns - It provides a list of all columns that the model should have.
- app_defaults - Values the input defaults to, when the project is run directly. For library projects, inputs can be remapped and appdefaults overridden, when library projects are imported.
  - table - Source table in warehouse that you are fetching the data from.
  - occurred_at_col - Timestamp column in the source table.
  - ids - List of all the ID’s that will be fetched.
- select - These are ID Stitching input that compute column expression for knowing the ID of each row in the table.
- type - These are IdEdgeSignals i.e. fields in different tables that are of the same type, so they can be stitched together.
- entity - Entity to which the ID belongs to.
- to_default_stitcher - Optional, it needs to be set to true explicitly for an ID to get picked in the default ID stitcher (i.e. the project file doesn’t have the key id_stitcher: models/<name of ID stitcher model>).

Note

The names of source tables mentioned in the inputs file are for reference purpose. Please change them to actual table names in your warehouse.

What is ID Stitching?

ID Stitching solves the problem of tying different identities together, for the same user across different sessions/devices.

Let’s consider that in your data warehouse, you have data coming in from various sources. These sources may include your public website, mobile app, and SalesForce CRM. Now let’s assume you want to know about the ID’s of users that have browsed your website in private mode. You can obtain these missing links, by stitching the same fields together.

When you create a new project, then the section on ID stitcher is unspecified. Reason for that is RudderStack creates a “default ID stitcher” behind-the-scenes, which reads from all input sources and ID types, to create stitched table. You can view all the models in your project by executing pb show models or pb show dependencies commands.

Test that it’s valid

After saving the edited file, execute pb compile command. If there is a compilation error you will see an error presented on the screen. You can also find any errors in the logs/logfile.log file. If there are no errors then it’ll generate SQL files in the output folder.

Test Connection Access

Execute the following command:

$ pb validate access

If you don’t get any error then it means you can read from the DW and create material tables on it.

Create ID Stitched Table

Creating ID Stitched Table

Once your projects successfully compiles you can now run your PB project. In order to do this, simply execute pb run command to materialize the ID stitcher table. In Terminal, you’ll be able to view the name of table created on the data warehouse.

Note

For detailed reference, check out the page Identity Stitching.

Viewing Generated Tables

Please use the view name user_default_id_stitcher (assuming you didn’t change the entity_key from user to something else).

Please use the same view name as the ID stitcher model defined in the file profiles.yaml of the project.

Even if there are multiple material tables, the view will always point to the latest generated table.

Follow these steps to log in to console and check out the generated tables:

Open https://<account>.snowflakecomputing.com/console.
Enter your username and password.
Click on Worksheets in top navigation bar of the webpage.
On the left sidebar, you can click on the Database and the corrosponding Schema to view list of all tables.
Hovering on each table name will show you the full table name along with date and time it was created.
You can double click on the appropriate view name to paste the name on empty worksheet.
Then, you can prefix select * from before the pasted view name and suffix limit 10; at the end.
Press Cmd+Enter, or click on Run button to execute the query.

Download a tool such as Postico2.
Create a new connection by entering all details. Click on Test Connection and then Connect.
Click on + next to Queries in left sidebar.
On the left sidebar, you can click on the Database and the corrosponding Schema to view list of all tables/views.
You can double click on the appropriate view name to paste the name on empty worksheet.
Then, you can prefix select * from before the pasted view name and suffix limit 10; at the end.
Press Cmd+Enter, or click on Run button to execute the query.

Enter your Databricks workspace URL in the web browser and login with your username + password.
Click on the “Catalog” icon in left sidebar
Choose the appropriate Catalog from the list and click on it to view contents.
You will see list of tables/views. Click on the appropriate table/view name to paste the name on worksheet.
Then, you can prefix select * from before the pasted view name and suffix limit 10; at the end.
Select the query text. Press Cmd+Enter, or click on Run button to execute the query.

To get Databricks connection details: select SQL Warehouse from left sidebar, and select the warehouse you intend to connect to. Select the tab “Connection details”. Note down Server hostname, Port, HTTP path.

To generate access token: Click on your username in top right and select “User Settings” in the drop down menu. Then, select “Developer” from left sidebar sub menu. Next to “Access Tokens”, click on the button “Manage” and then “Generate new token”.

You can get username from “Profile” in left sidebar sub menu and copy the string within brackets next to “Display name”.

Here’s what the columns imply:

<entity_name>_main_id: rudderId generated by Profile Builder. If entity_name is user then it’ll be named user_main_id. Think of a 1-to-many relationship, with one rudderId connected to different IDs belonging to same user such as userId, anonymousId, email, phone, etc.
other_id: ID in input source tables that is stitched to a rudderID.
other_id_type: Type of the other ID to be stitched (userId, anonymousId, email, etc).
valid_at: Date at which the corresponding ID value occurred in the source tables. For example, the date at which a customer was first browsing anonymously, or when they logged into the CRM with their email ID, etc.

Create Entity Vars

What are Entity Vars?

Entity Var is a single value per entity instance, derived by performing calculation or aggregation on a set of values. An Entity Var can also be defined as a Feature, that act as pieces of useful information. An Entity Traits 360 is a collection of all such features.

Modify ID Stitched Project

Let’s extend the previous example, where input data got stitched. Now, on top of that table, you want to get a var table created with these features:

Date a customer was last active.
Life Time Value of a customer.

Similar to ID Stitching, create your first PB var table using pb init pb-project, a sample of which is shown below:

var_groups:
  - name: default_vars
    entity_key: user
    vars:
      - entity_var:
          name: first_seen
          select: min(timestamp::date)
          from: inputs/rsTracks
      - entity_var:
          name: last_seen
          select: max(timestamp::date)
          from: inputs/rsTracks
      - entity_var:
          name: user_lifespan
          description: Life Time Value of a customer
          select: '{{user.Var("last_seen")}} - {{user.Var("first_seen")}}'
      - entity_var:
          name: days_active
          description: No. of days a customer was active
          select: count(distinct timestamp::date)
          from: inputs/rsTracks
models:
  - name: user_profile
    model_type: feature_table_model
    model_spec:
      validity_time: 24h # 1 day
      entity_key: user
      features:
        - user_lifespan
        - days_active

Here are the fields you can customise:

var_groups:

name - A friendly name for the group of entity_vars.
entity_key - Entity to which the model belongs to, as defined in the project file. In our case we will set it to user.
vars - List of all the features that will be created.
- entity_var - These are vars evaluated from which one can select all that should be added as features in final output table. Each entity_var has a select query by which you are computing, from the source table from which var is evaluated, storing the result in name.
  - You can also optionally apply a where on this, as a SQL like clause.
  - To apply a regular expression in computing a value, you can refer functions from a macro defined in the file macros.yaml.
  - Optional, a description of the computed entity_var.
- input_var - They have a similar structure as entity_var, except that they’re helper inputs, so they do not get stored in the final output table.

Note

The implicit typecasting behavior of Snowflake and Redshift is different.

models:

name - Set this to name of the feature table you wish to create on the DW.
model_type - This should be set to feature_table_model.
entity_key - Entity to which the model belongs to, as defined in the project file. In our case we will set it to user.
features - From the above list of entity_var, these are the the final entityvars that will be added as columns to output entity traits 360 table.

Note

Any entity_var that is not part of this features list are not stored in the final table, and are deleted after the job is run.

Creating Features Table

After saving the edited file, execute pb run command to materialize the Features Table. On screen you’ll be able to see the table name created in the data warehouse, that you can view on its console, as described in above section Viewing Generated Tables.

Note

For detailed reference, check out the page Entity Traits 360.

SQL Template Models

You can also create models with custom SQL. Please refer to the SQL Template Models section.

Generated Table

As illustrated in Viewing Generated Tables, you can log in to warehouse and view the generated table(s). Please use the same view name as the feature table model defined in the file profiles.yaml of the project. Even if there are multiple material tables, the view will always point to the latest generated table. Here’s what the columns imply:

<entity_name>_main_id: rudderID generated by PB. If entity_name is user then it’ll be named user_main_id.
valid_at: Date when the entry was created for this record.
first_seen, last_seen, country, first_name, etc. - All features for which values are computed.

Automate Scheduling

Check out the page on Scheduling.

Conclusion

On this page, we have shown you how to install pb and use it to create your first project. We hope that this has been helpful and that you can now start using pb to quickly stitch together your data and create a 360 table of your customer. This is just a brief overview, there are many more features and commands available, so be sure to check out the complete documentation for more information.