Entity Traits 360

You may have variety of data about your customers or other entities in your warehouse. This document details how you can extract various features from all the data in your warehouse via RudderStack.

PB entity features is envisioned to have all the computed traits you want to have derived from your warehouse data tables. Each column in the table gives a trait for an entity.

Some of the advantages of using the PB YAML to generate the entity features view:

  • Identity Stitching is done behind the scenes. Output of ID stitcher is incorporated transparently.

  • Hides the complexity of the query execution plan from business logic, making it easier to maintain. Especially, as the number of entity features increase.

  • Generates performant SQL using well researched SQL idioms. As the feature table becomes wider, dependencies become deeper and feature sources come from many teams.

Designed to meaningfully reflect the SQL that you want to generate by abstracting out query complexity. If you have some experience with SQL, its easy to understand how to define the YAML for desired traits. This document will walk you through details how you can define your project YAML file(s).

Now that you have a unified user trait table(s), below are some possible ways you can use:

  • Build various models like churn prediction (RudderStack ML models can be used for this).

  • Analytics queries like demographic view, user activity view etc.

  • Activate this data using RudderStack ReverseETL to send to various cloud destinations [Doc].

  • Build and use RudderStack Audiences (available for beta customers).

Creating Entity Features

A basic structure of any entity feature YAML would look like this:

var_groups:
  - name: <string>
    entity_key: <string>
    vars: [entity_vars, input_vars]

Each feature is defined by using entity_var. Vars are grouped under var_groups, so that some config keys can be shared across vars.

Example Entity Feature YAML:

var_groups:
  - name: default_vars
    entity_key: user
    vars:
      # Latest Country
        - entity_var:
            name: max_timestamp_for_country
            select: max(timestamp)
            from: inputs/identifies_1
            where: properties_country is not null and properties_country != ''
            is_feature: false

        - entity_var:
            name: latest_country
            select: max(properties_country)
            from: inputs/identifies_1
            where: timestamp = max_timestamp_for_country and main_id is not null
        # If you have more granular demographic data, then you can get more accurate
        # information about the customer in the similar manner using feature YAML.

        # Active Time
        - entity_var:
            name: first_active
            select: min(timestamp)
            from: inputs/identifies_1
            is_feature: false

        - entity_var:
            name: last_active
            select: max(timestamp)
            from: inputs/identifies_1
            is_feature: false

        - entity_var:
            name: active_days
            select:  "{{ subtract_range('last_active', 'first_active')}}"
            # subtract_range is a macro, defined below in a separate YAML file

This will create an entity var table. As shown below, this table brings together all features defined, across all vars in var_groups.

../_images/entity-var-table.png

Var groups

type

object

properties

  • name

Name of var_group, that will be used to uniquely identify it.

type

string

  • entity_key

Specify the entity to be used.

type

time

  • vars

This section is used to specify variables, with the

help of entity_var and input_var.

Aggregation on stitched ID type is done by default and is implicit.

type

list

entity_var

This is a special type of variable that can be used to store some data temporarily.

To store data permanently, they can be specified as a ‘feature’ in the next key.

While creating this, grouping on main ID type of your ID Stitcher model is done by default.

properties

  • name

Name of the entity_var, this will be used to uniquely identify it.

type

string

  • select

The column name/value you want to select from the table. This defines the actual

value that will be stored in the variable. You can use simple SQL expressions as well here.

You can also select an entity var as {{entityName.Var(“entity_var”)}}.

It has to be an aggregate operation that ensures the output is a unique value for a given main_id. Ex: min(timestamp), count(*), sum(amount) etc.

This holds true even when a window function (optional) is used. Ex: first_value(), last_value() etc are valid while rank(), row_number() etc are not valid and give unpredictable results.

type

string

  • from

Here you have to specify the info of the table to be used. You can either refer to

another model from the same YAML or some other table specified in input YAML.

type

list

  • where

Any filters you want to apply on the input table before selecting any value.

This needs to be SQL compatible and must be respecting the data type of the table.

type

string

  • default

The default value that needs to be stored in case no data matches the filter.

type

string

  • description

Description of the entity_var, and it is always good to have a good description for your

entity_var. This helps in understanding the logic without overhead of understanding the

value SQL.

type

string

  • is_feature

Whether the entity_var is a feature.

By default it is set to true.

type

boolean

  • window

If you want to use window function, that needs to be specified with the help of this key.

Window function usually operates over a window of rows so we need a subgroup of

rows for this to work. Apart from some rank-related functions, the rows in window

need at-least one column in it. Window functions in SQL usually have both partition by and order by properties.

But for entity_var, partition_by is added with main_id as default; So, adding partition_by manually

is not supported here. If you need partitioning on other columns too,

check out input_vars where partition_by on arbitrary and multiple columns is supported.

type

object

properties

  • order_by

This orders rows within the window.

type

string

Note

As this does a group by main_id, the output would be one row per main_id.

input_var

With syntax almost identical to entity_var, the difference is that instead of each value being associated a row of the feature table,

it’s associated with a row of the specified input. Effectively, while a entity_var can be thought of as adding a helper column to the feature table,

an input_var can be thought of as adding a helper column to the input.

properties

  • name

This is the name in which you’ll store the retrieved data.

type

string

  • select

The data that will be stored in the name.

type

string

  • from

A reference to the source table from which data is to be fetched.

type

list

  • where

Optional, in case you want to apply any conditions for fetching data (same as where clause in SQL).

type

string

  • default

Optional, a default value for any entities for which the calculated value would otherwise be NULL.

type

string

  • description

Optional, a textual description of what it does.

type

string

  • window

Optional, for specifying a window over which the value should be calculated.

type

object

properties

  • partition_by

Optional, list of sql expressions to use in partitioning the data.

type

string

  • order_by

Optional, list of sql expressions to use in ordering the data.

type

string

Note

In window option, main_id is not added by default unlike tablevar; it can be any arbitrary list of columns from the input table. So if a feature should be partitioned by main_id, that should be added with the partition_by key.

Warning

If an entity_var is expected to be derived from more than one input_var - then all the input_var need to be on the same table.

Note

There won’t be any group by main_id operator in this, and it is used only to apply some transformations on already calculated tablevars/featurevars.

Default values

When defining default values for an entity_var or input_var, enclose string values in single quotes followed by double quotes to avoid SQL failure. Non-string values can be used without quotes. For example:

entity_var:
    name: campaign_source_first_touch
    default: "'organic'"

entity_var:
    name: user_rank
    default: -1

Window Functions

In PB, Profiles YAML have support to use window functions in entity var and input var definitions.

What is a window function: A window function operates on a group (“window”) of related rows. Each time a window function is called, it is passed a row (the current row in the window) and the window of rows that contain the current row. The window function returns one output row for each input row. The values returned are calculated by using values from the sets of rows in that window. For each row in the table, the window defines a set of rows that is used to compute additional attributes. A window is defined using a window specification (the OVER clause), and is based on three main concepts:

  • Window partitioning, which forms groups of rows (PARTITION clause)

  • Window ordering, which defines an order or sequence of rows within each partition (ORDER BY clause)

  • Window frames, which are defined relative to each row to further restrict the set of rows (ROWS specification)

We already had the support of specifying PARTITION and ORDER BY while using the window function to write the features in YAML. Now we can specify the ROWS as well which is known as frame clause. Most of the warehouses supports two types of window functions: aggregate and ranking.

Why this is important: Snowflake does not enforce user to define the cumulative or sliding frames, and it considers ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING as default cumulative window frame. But user can override this by defining the frame manually as well. But on the other hand, in Redshift if an ORDER BY clause is used for an aggregate function, an explicit frame clause is required but if a ranking window functions is specified then frame clause should not be specified. So if there is any use of window function then we have to be aware of using frame_clause . It is not very critical for Snowflake but it is very critical for Redshift as this will result in errors.

On Redshift aggregate window function, list given below, while using any function from the list, please specify the frame_clause:

  • AVG

  • COUNT

  • CUME_DIST

  • DENSE_RANK

  • FIRST_VALUE

  • LAG

  • LAST_VALUE

  • LEAD

  • LISTAGG

  • MAX

  • MEDIAN

  • MIN

  • NTH_VALUE

  • PERCENTILE_CONT

  • PERCENTILE_DISC

  • RATIO_TO_REPORT

  • STDDEV_POP

  • STDDEV_SAMP (synonym for STDDEV)

  • SUM

  • VAR_POP

  • VAR_SAMP (synonym for VARIANCE)

On Redshift ranking window functions given below, while using any function from the list, please DO NOT specify the frame_clause:

  • DENSE_RANK

  • NTILE

  • PERCENT_RANK

  • RANK

  • ROW_NUMBER

How to specify frame_clause:

- entity_var:
    name: first_num_b_order_num_b
    select: first_value(tbl_c.num_b)
    from: inputs/tbl_c
    default: -1
    where: tbl_c.num_b >= 10
    window:
        order_by:
        - tbl_c.num_b desc
        frame_clause: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
- entity_var:
    name: first_num_b_order_num_b_rank
    select: rank()
    window:
        partition_by:
        - first_num_b_order_num_b > 0
        order_by:
        - first_num_b_order_num_b asc

Please pay attention how frame_clause is specified in first entity_var and not in the second one.

Defining ML Features

PythonML

More information on ProfilesML at RudderStack Docs.

Models from ML Notebooks

You can take ML Notebooks in Jupyter and add ML/AI capabilities to it.

Complete Traits 360 degree view

Traits defined using entity_vars and ML models can all be unified using traits360 models. The corresponding entries are made in pb_project.yaml. Please check out the section Entity Traits 360 Models for details.

Feature Table (legacy / deprecated)

PB Entity Feature YAML allows you to define what traits needs to be evaluated from provided data sources to create and keep feature table(s) updated. The section on Identity Stitching detailed how you can leverage RudderStack to do ID stitching to identify your users. Each identified user now can have multiple traits that you want to evaluate and maintain.

Basic things to have before starting up with the feature YAML:

  • name: This is going to be used in the naming of final feature table in warehouse. For example if name is WACampaignFeatures then the name of the feature table and final view in the warehouse will be Material_WACampaignFeatures_<some_hash>.

  • model_type: should be feature_table_model.

A basic structure of any feature YAML would look like this:

models:
  - name: <string>
    model_type: feature_table_model
    model_spec:
        validity_time: <time>
        entity_key: <string>
        features: <list>

Example:

models:
  - name: e_comm_feature
    model_type: feature_table_model
    model_spec:
        validity_time: 24h # 1 day
        entity_key: user
        features:
          - latest_country
          - active_days

model_spec:

Here you have to give all the specifications for your expected model type. In the case of feature_table_model the specifications are as follows:

name

Name of the feature table created on the DW. Say if you define this as feature_table_model

then the output table will be named something like

Material_final_feature_table_<rest of generated table name>.

type

string

model_type

Define the types of models. As we are creating a Feature Table, set this to feature_table_model.

type

string

model_spec

Here you have to give all the specifications for your expected model type.

In the case of feature_table_model the specifications are as follows:

type: list

properties

  • validity_time

This specifies the validity of generated feature table.

Once the validity is expired, scheduling takes care of generating new tables.

for example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days

type

time

  • entity_key

Here you have to give the reference of entity to be used.

An entity can either refer to another ID stitcher model specified in the same project or an

ID stitcher model from same YAML.

type

string

  • features

This section is used to specify list of``entity_var``,

that act as a feature. An entity_var is by default defined as a feature.

type

list

Defining Macros (optional)

Example Macros YAML:

macros:
    - name: subtract_range
        inputs:
        - a
        - b
        value: "{{a}} - {{b}}"

macros

Here you can specify the macros that can be used later in features section as helpers.

Macros can be used to encapsulate complex processing logic directly within SQL that

can be reused. You can consider these as functions.

type: list

properties

  • name

Name of macro, this will be used to uniquely identify and reference this macro in features section.

type

string

  • inputs

If the logic you are trying to encapsulate works on some input variable, then you need to define those here.

type

list

  • value

Actual logic, this must be a SQL expression.

Any reference to the inputs within it should be surrounded by double curly brackets.

type

string

Feature Table Code Examples

Please contact our team for access to example projects.