Entity Traits 360

You may have variety of data about your customers or other entities in your warehouse. This document details how you can extract various features from all the data in your warehouse via RudderStack.

PB entity features is envisioned to have all the computed traits you want to have derived from your warehouse data tables. Each column in the table gives a trait for an entity.

Some of the advantages of using the PB YAML to generate the entity features view:

Identity Stitching is done behind the scenes. Output of ID stitcher is incorporated transparently.
Hides the complexity of the query execution plan from business logic, making it easier to maintain. Especially, as the number of entity features increase.
Generates performant SQL using well researched SQL idioms. As the feature table becomes wider, dependencies become deeper and feature sources come from many teams.

Designed to meaningfully reflect the SQL that you want to generate by abstracting out query complexity. If you have some experience with SQL, its easy to understand how to define the YAML for desired traits. This document will walk you through details how you can define your project YAML file(s).

Now that you have a unified user trait table(s), below are some possible ways you can use:

Build various models like churn prediction (RudderStack ML models can be used for this).
Analytics queries like demographic view, user activity view etc.
Activate this data using RudderStack ReverseETL to send to various cloud destinations [Doc].
Build and use RudderStack Audiences (available for beta customers).

Creating Entity Features

A basic structure of any entity feature YAML would look like this:

var_groups:
  - name: <string>
    entity_key: <string>
    vars: [entity_vars, input_vars]

Each feature is defined by using entity_var. Vars are grouped under var_groups, so that some config keys can be shared across vars.

Example Entity Feature YAML:

var_groups:
  - name: default_vars
    entity_key: user
    vars:
      # Latest Country
        - entity_var:
            name: max_timestamp_for_country
            select: max(timestamp)
            from: inputs/identifies_1
            where: properties_country is not null and properties_country != ''
            is_feature: false

        - entity_var:
            name: latest_country
            select: max(properties_country)
            from: inputs/identifies_1
            where: timestamp = max_timestamp_for_country and main_id is not null
        # If you have more granular demographic data, then you can get more accurate
        # information about the customer in the similar manner using feature YAML.

        # Active Time
        - entity_var:
            name: first_active
            select: min(timestamp)
            from: inputs/identifies_1
            is_feature: false

        - entity_var:
            name: last_active
            select: max(timestamp)
            from: inputs/identifies_1
            is_feature: false

        - entity_var:
            name: active_days
            select:  "{{ subtract_range('last_active', 'first_active')}}"
            # subtract_range is a macro, defined below in a separate YAML file

This will create an entity var table. As shown below, this table brings together all features defined, across all vars in var_groups.

Var groups

type	object
properties
name	Name of var_group, that will be used to uniquely identify it.
	type	string
entity_key	Specify the entity to be used.
	type	time
vars	This section is used to specify variables, with the help of entity_var and input_var. Aggregation on stitched ID type is done by default and is implicit.
	type	list

entity_var

This is a special type of variable that can be used to store some data temporarily. To store data permanently, they can be specified as a ‘feature’ in the next key. While creating this, grouping on main ID type of your ID Stitcher model is done by default.
properties
name	Name of the entity_var, this will be used to uniquely identify it.
	type	string
select	The column name/value you want to select from the table. This defines the actual value that will be stored in the variable. You can use simple SQL expressions as well here. You can also select an entity var as {{entityName.Var(“entity_var”)}}. It has to be an aggregate operation that ensures the output is a unique value for a given main_id. Ex: min(timestamp), count(*), sum(amount) etc. This holds true even when a window function (optional) is used. Ex: first_value(), last_value() etc are valid while rank(), row_number() etc are not valid and give unpredictable results.
	type	string
from	Here you have to specify the info of the table to be used. You can either refer to another model from the same YAML or some other table specified in input YAML.
	type	list
where	Any filters you want to apply on the input table before selecting any value. This needs to be SQL compatible and must be respecting the data type of the table.
	type	string
default	The default value that needs to be stored in case no data matches the filter.
	type	string
description	Description of the entity_var, and it is always good to have a good description for your entity_var. This helps in understanding the logic without overhead of understanding the value SQL.
	type	string
is_feature	Whether the entity_var is a feature. By default it is set to true.
	type	boolean
window	If you want to use window function, that needs to be specified with the help of this key. Window function usually operates over a window of rows so we need a subgroup of rows for this to work. Apart from some rank-related functions, the rows in window need at-least one column in it. Window functions in SQL usually have both partition by and order by properties. But for entity_var, partition_by is added with main_id as default; So, adding partition_by manually is not supported here. If you need partitioning on other columns too, check out input_vars where partition_by on arbitrary and multiple columns is supported.
	type	object
	properties
	order_by	This orders rows within the window.
		type	string

Note

As this does a group by main_id, the output would be one row per main_id.

input_var

With syntax almost identical to entity_var, the difference is that instead of each value being associated a row of the feature table, it’s associated with a row of the specified input. Effectively, while a entity_var can be thought of as adding a helper column to the feature table, an input_var can be thought of as adding a helper column to the input.
properties
name	This is the name in which you’ll store the retrieved data.
	type	string
select	The data that will be stored in the `name`.
	type	string
from	A reference to the source table from which data is to be fetched.
	type	list
where	Optional, in case you want to apply any conditions for fetching data (same as where clause in SQL).
	type	string
default	Optional, a default value for any entities for which the calculated value would otherwise be NULL.
	type	string
description	Optional, a textual description of what it does.
	type	string
window	Optional, for specifying a window over which the value should be calculated.
	type	object
	properties
	partition_by	Optional, list of sql expressions to use in partitioning the data.
		type	string
	order_by	Optional, list of sql expressions to use in ordering the data.
		type	string

Note

In window option, main_id is not added by default unlike tablevar; it can be any arbitrary list of columns from the input table. So if a feature should be partitioned by main_id, that should be added with the partition_by key.

Warning

If an entity_var is expected to be derived from more than one input_var - then all the input_var need to be on the same table.

Note

There won’t be any group by main_id operator in this, and it is used only to apply some transformations on already calculated tablevars/featurevars.

Default values

When defining default values for an entity_var or input_var, enclose string values in single quotes followed by double quotes to avoid SQL failure. Non-string values can be used without quotes. For example:

entity_var:
    name: campaign_source_first_touch
    default: "'organic'"

entity_var:
    name: user_rank
    default: -1

Window Functions

In PB, Profiles YAML have support to use window functions in entity var and input var definitions.

What is a window function: A window function operates on a group (“window”) of related rows. Each time a window function is called, it is passed a row (the current row in the window) and the window of rows that contain the current row. The window function returns one output row for each input row. The values returned are calculated by using values from the sets of rows in that window. For each row in the table, the window defines a set of rows that is used to compute additional attributes. A window is defined using a window specification (the OVER clause), and is based on three main concepts:

Window partitioning, which forms groups of rows (PARTITION clause)
Window ordering, which defines an order or sequence of rows within each partition (ORDER BY clause)
Window frames, which are defined relative to each row to further restrict the set of rows (ROWS specification)

We already had the support of specifying PARTITION and ORDER BY while using the window function to write the features in YAML. Now we can specify the ROWS as well which is known as frame clause. Most of the warehouses supports two types of window functions: aggregate and ranking.

Why this is important: Snowflake does not enforce user to define the cumulative or sliding frames, and it considers ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING as default cumulative window frame. But user can override this by defining the frame manually as well. But on the other hand, in Redshift if an ORDER BY clause is used for an aggregate function, an explicit frame clause is required but if a ranking window functions is specified then frame clause should not be specified. So if there is any use of window function then we have to be aware of using frame_clause . It is not very critical for Snowflake but it is very critical for Redshift as this will result in errors.

On Redshift aggregate window function, list given below, while using any function from the list, please specify the frame_clause:

AVG
COUNT
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAG
LAST_VALUE
LEAD
LISTAGG
MAX
MEDIAN
MIN
NTH_VALUE
PERCENTILE_CONT
PERCENTILE_DISC
RATIO_TO_REPORT
STDDEV_POP
STDDEV_SAMP (synonym for STDDEV)
SUM
VAR_POP
VAR_SAMP (synonym for VARIANCE)

On Redshift ranking window functions given below, while using any function from the list, please DO NOT specify the frame_clause:

DENSE_RANK
NTILE
PERCENT_RANK
RANK
ROW_NUMBER

How to specify frame_clause:

- entity_var:
    name: first_num_b_order_num_b
    select: first_value(tbl_c.num_b)
    from: inputs/tbl_c
    default: -1
    where: tbl_c.num_b >= 10
    window:
        order_by:
        - tbl_c.num_b desc
        frame_clause: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
- entity_var:
    name: first_num_b_order_num_b_rank
    select: rank()
    window:
        partition_by:
        - first_num_b_order_num_b > 0
        order_by:
        - first_num_b_order_num_b asc

Please pay attention how frame_clause is specified in first entity_var and not in the second one.

Defining ML Features

PythonML

More information on ProfilesML at RudderStack Docs.

Models from ML Notebooks

You can take ML Notebooks in Jupyter and add ML/AI capabilities to it.

Complete Traits 360 degree view

Traits defined using entity_vars and ML models can all be unified using traits360 models. The corresponding entries are made in pb_project.yaml. Please check out the section Entity Traits 360 Models for details.

Feature Table (legacy / deprecated)

PB Entity Feature YAML allows you to define what traits needs to be evaluated from provided data sources to create and keep feature table(s) updated. The section on Identity Stitching detailed how you can leverage RudderStack to do ID stitching to identify your users. Each identified user now can have multiple traits that you want to evaluate and maintain.

Basic things to have before starting up with the feature YAML:

name: This is going to be used in the naming of final feature table in warehouse. For example if name is WACampaignFeatures then the name of the feature table and final view in the warehouse will be Material_WACampaignFeatures_<some_hash>.
model_type: should be feature_table_model.

A basic structure of any feature YAML would look like this:

models:
  - name: <string>
    model_type: feature_table_model
    model_spec:
        validity_time: <time>
        entity_key: <string>
        features: <list>

Example:

models:
  - name: e_comm_feature
    model_type: feature_table_model
    model_spec:
        validity_time: 24h # 1 day
        entity_key: user
        features:
          - latest_country
          - active_days

model_spec:

Here you have to give all the specifications for your expected model type. In the case of feature_table_model the specifications are as follows:

name

Name of the feature table created on the DW. Say if you define this as feature_table_model

then the output table will be named something like

Material_final_feature_table_<rest of generated table name>.

type

string

model_type

Define the types of models. As we are creating a Feature Table, set this to `feature_table_model`.
type	string

model_spec

Here you have to give all the specifications for your expected model type. In the case of `feature_table_model` the specifications are as follows:
type: list
properties
validity_time	This specifies the validity of generated feature table. Once the validity is expired, scheduling takes care of generating new tables. for example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days
	type	time
entity_key	Here you have to give the reference of entity to be used. An entity can either refer to another ID stitcher model specified in the same project or an ID stitcher model from same YAML.
	type	string
features	This section is used to specify list of``entity_var``, that act as a feature. An entity_var is by default defined as a feature.
	type	list

Defining Macros (optional)

Example Macros YAML:

macros:
    - name: subtract_range
        inputs:
        - a
        - b
        value: "{{a}} - {{b}}"

macros

Here you can specify the macros that can be used later in `features` section as helpers. Macros can be used to encapsulate complex processing logic directly within SQL that can be reused. You can consider these as functions.
type: list
properties
name	Name of macro, this will be used to uniquely identify and reference this macro in `features` section.
	type	string
inputs	If the logic you are trying to encapsulate works on some input variable, then you need to define those here.
	type	list
value	Actual logic, this must be a SQL expression. Any reference to the inputs within it should be surrounded by double curly brackets.
	type	string

Feature Table Code Examples

Please contact our team for access to example projects.