Troubleshooting

Following section describes how you can debug in case you want to troubleshoot the execution of data models.

General

Even though my DW username and password are correct, I am getting ‘Authentication FAILED’ error while executing run/compile commands. Why so?

Even if your DW credentials are correct, you will get this Snowflake error, if your user does not have the rights to read and write data. To resolve, kindly ask the administrator to change your role or grant these privileges.

After my run query got complete, I tried to see the results on DW, for name displayed on screen. I typed SELECT * FROM "MY_WAREHOUSE"."MY_SCHEMA"."Material_domain_profile_c0635987_6". Am now getting error “Object does not exist or not authorized”.

Please remove the double quotes and run the query, that is SELECT * FROM MY_WAREHOUSE.MY_SCHEMA.Material_domain_profile_c0635987_6.

When I try to create Hello World Project by executing the command pb init pb-project then I get message “Error: mkdir HelloPbProject : file exists”.

That’s because the directory HelloPbProject already exists. You can rename or remove that directory. Or, you can create project of another name, by passing that as parameter.

When I tried to execute the command pb init pb-project then I get error that I don’t have permission for mkdir.

It’s an issue on your machine, you’ll have to give relevant permissions to the directory.

I created a project and ran the command pb validate access , after setting correct credentials in connection. Yet, I am getting the error rudder_events_production.web.identifies does not exist or is not accessible .

These are sample table names that do not exist. Please change rudder_events_production.web.identifies to actual table name in your schema.

While trying to add a feature table, I get an error at line 501, but I do not have these many lines in my YAML.

The line number refers to the generated SQL file in the output folder. Check the console for the exact file name with the sequence number in the path.

I executed the auto migrate command and now I see a bunch of nested original_project_folder. Are we migrating through each different version of the tool?

This is a symlink to the original project. Click on it in the Finder (Mac) to open the original project folder.

How do I migrate siteconfig file?

If your project is on Schema 9: Rename the folder in home directory containing siteconfig mv ~/.wht/siteconfig.yaml ~/.pb/siteconfig.yaml

If your project is on Schema 18: Migrating your project will do for default siteconfig as well.

Other schema projects greater than 18: No need to migrate siteconfig.

I am getting a ssh: handshake failed error when referring to a public project hosted on GitHub. It throws error for https:// path and works fine for ssh: path. I have set up token in GitHub and added to siteconfig.yaml file, but I still get this error.

You need to follow a different format for gitcreds: in siteconfig. See Site Configuration for the format.

After changing siteconfig, if you still get an error, then clear the WhtGitCache folder inside the directory having the siteconfig file.

If I add filters to id_types in the project file, then do all rows that include any of those values get filtered out of the analysis, or is it just the specific value of that id type that gets filered?

The PB tool doesn’t extract rows. Instead, it extracts pairs from rows. So if you had a row with email, user_id, and anonymous_id and the anonymous_id is excluded, then the PB tool still extracts the email, user_id edge from the row.

Should there be only one project in a GIT Repo?

That is not necessary. You can create multiple folders in your project repo and create different projects in each folder. For the scheduler to know which sub folder’s project to run, the GIT URL needs to be as follows:

https://github.com/<org-name>/<repo-name>/tree/<branch-name>/path/to/project Or https://github.com/<org-name>/<repo-name>/tag/<tag-name>/path/to/project Or https://github.com/<org-name>/<repo-name>/commit/<commit-hash>/path/to/project.

Can I specify any git account such as CommitCode / BitBucket, when configuring a project?

Presently, we only support repos hosted on GitHub.

Command Progress & Lifecycle

I executed a command and it is taking me too long a time. Is there any way I can kill the processes on DW?

In case it is taking too long time to execute queries such as discover or run, then it could be due to other queries running simultaneously on the same warehouse. To clear them up, please open the Queries tab on your DW (Snowflake) and then manually kill the long running processes.

Due to huge size of data, I am experiencing long execution times. This is causing my screen to lock out, thereby preventing the process from getting completed.

You can use the screen command on UNIX/MacOS to detach your screen and allow the process to run in the background. So you can use your terminal for other tasks, thus avoiding screen lockouts and allowing the query to complete successfully.

Here are some examples:

To start a new screen session and execute a process in detached mode, use the following command: screen -L -dmS profiles_rn_1 pb run (use the -L flag to enable logging, and -dmS to start as a daemon process in detached mode. profiles_rn_1 is a name we’ve given to running process).
To list all active screen sessions: screen -ls.
To reattach to a detached screen session: screen -r [PID or screen name].

Does your tool has logging enabled? We need it for security and compliance purposes.

For nearly all the commands executed by CLI (init, validate access, compile, run, cleanup, etc.), logging is enabled by default. All the output that is shown on screen, their logs are stored in the file logfile.log inside logs directory of your project folder. This includes successful as well as failed runs. Newer entries are appended to the end of file once a command has been executed.

Exceptions where logs aren’t stored are:

query: The logger file stores “Printing output” and doesn’t store the actual DB output. Though the SQL queries are logged in warehouses in Snowflake.
help: For any command.

On the warehouse, I see lot of material_user_id_stitcher_ tables generated in the rs_profiles schema. How do we identify the latest ID stitched table?

The view user_id_stitcher will always point to the latest generated ID stitcher. So you may check its definition to see the exact table name it’s referring to.

How can I remove material tables that are no longer needed?

To cleanup all the materials older than say, 10 days, please execute this command: pb cleanup materials -r 10. The minimum value you can set here is 1. So if you’ve just run the ID stitcher today then you can remove all the older materials using pb cleanup materials -r 1.

Which tables and views are important in Profiles schema that I shouldn’t delete?

material_registry
material_registry_<number>
pb_ct_version
ptr_to_latest_seqno_cache
wht_seq_no
wht_seq_no_<number>
Views whose names match your models in the YAML files.
Material tables from the latest run (you may use pb cleanup materials command).

CLI was running earlier, now it’s not able to access tables. Does it delete the view and create again?

Yes, each time the project runs, it creates a new materials table and replaces the view. So, you need to grant a select on future views/tables in respective schema and not just all the existing ones.

Does the CLI support downloading a git repo using siteconfig before executing pb run ? Or I will have to manually clone the repo first?

You can simply pass the GIT URL as parameter, instead of project’s path. That is, pb run -p git@......

In the material registry table, what does status: 2 mean?

status: 2 means that the material has successfully completed its run. status: 1 means that the material did not complete its run.

I am a Windows user, who is getting error: Error: while trying to migrate project: applying migrations: symlink <path>: A required privilege is not held by the client.

Your user requires privileges to create a symlink. Either you may grant extra privileges to the user, or try using an Admin user on PowerShell. In case that doesn’t help, you may try to install and use it via WSL.

When executing run command, I am getting message: Please use sequence number ... to resume this project in future runs . Does it mean that a user can exit using Ctrl+C and later if they give this seq_no then it’ll continue from where it was cancelled earlier?

pb run --seq_no <> flag allows for the provision of a sequence number to run the project. This flag can either resume an existing project or use the same context to run it again. With the introduction of time grain models, multiple sequence numbers can be assigned and used for a single project run.

What flag should I give, if I want to force run for same end time, even if a previous run exists?

Please execute pb run --force --model_refs models/my_id_stitcher,entity/user/user_var_1,entity/user/user_var_2,...

Can hash change, even if schema version didn’t change?

Yes, because hash versions depends on project’s implementation, schema versions is for project’s YAML layout.

Compile Command

I was trying to execute compile command by fetching repo via GIT URL. However, even though I have the public key in GIT repo project and private key in my local directory, am getting error: making git new public keys: ssh: no key found.

Please add the OpenSSH private key to your siteconfig file. In case you then get the error could not find expected then kindly correct spacing levels of your siteconfig file.

While trying to segregate ID stitcher and Feature Table in separate model files, I am getting error: mapping values are not allowed in this context.

These are due to spacing errors in YAML. Please correct spacing, you may create a new project using init pb-project command and compare with the spacing values there. Also, please ensure you haven’t missed any keys.

ID Stitcher

There are many large size connected components in my DW. To increase accuracy of stitched data, I want to increase the number of iterations. Will it be possible?

The default value of largest diameter, i.e. the longest path length in connected components, is 30. To increase that, you can define a key max_iterations in the ID Stitcher YAML file, and specify the value as max diameter of connected components. However, please note that by having a large number of iterations, the algorithm can give incorrect results.

Will I have to write different query each time for viewing data of created table?

No, however you can use view name, which always points to the latest created material table. For example, if you’ve defined user_stitching in your model YAML file, then execute SELECT * FROM MY_WAREHOUSE.MY_SCHEMA.user_stitching.

Do you have any anomaly detection in ID stitching? For example, if the average number of distinct nodes in your ID graph is 30 and one rudder ID has 500, does it throw out that rudder ID?

Not currently, that threshold is hard to set automatically without some data investigation. Like, maybe it’s big because you had a misconfiguration and you need to fix something. Though we might add this feature sometime in future.

In my model, I have set the key validity_time: 24h . What happens when the validity of generated tables expire? Will re-running the ID stitcher generate the same hash until the validity expires? If I run automatic scheduler via web and it’s less than validity date then will it not generate a new table?

Firstly, hash does not depend on the timestamp, it depends on yaml in the underlying code. That’s why the material name is material_NAME_HASH_SEQNO. The SEQNO depends on timestamp. Secondly, a material generated for a specific timestamp (aside for the timeless timestamp) won’t be regenerated unless you do a pb run --force. The CLI will look to see if the material you are requesting already exists in the DB, see that it does, and return that. The validity_time is an extension of that. For a model with validity_time: 24h and inputs that all have timestamp columns: If you request a material for now, but one was generated for that model 5 minutes ago, the CLI will return that one instead. Using the CLI to run a model always generates a model for a certain timestamp, it’s just if you don’t specify a timestamp then it uses the current timestamp. So, for a model with validity_time vt (say), and inputs that all have timestamp columns: If you request a material for t1, but one already exists for t0 where t1-vt <= t0 <= t1, the CLI will return that one instead. If multiple materials exist that would satisfy that requirement, then it returns the one with timestamp closest to t1.

I want to use customer_id in place of main_id. So I changed the name in pb_project.yaml , however now am getting this error: Error: validating project sample_attribution: listing models for child source models/: error listing models: error building model domain_profile_id_stitcher: main id type main_id not in project id types.

In addition to making changes in the file pb_project.yaml, also you’d need to set main_id_type: customer_id in the ID Stitcher model YAML file.

Do you have any anomaly detection in your id stitching? For example, if the average number of distinct nodes in your ID graph is 30 and one rudder ID has 500, does that throw out that rudder ID?

Not currently, as that threshold is hard to set automatically without some data investigation. Like, maybe it’s big because of some misconfiguration which needs to be fixed. However, we could add something like that later in our product.

I ran ID stitcher but am not able to see the final generated table. I cannot see it under list of tables in Snowflake.

Please check the Views drop down inside Databases on the left side bar. Say if the name defined in the specifications model file was domain_profile_id_stitcher then this is what you should see. In case it is still not visible then you’ll have to change the role using dropdown menu in the upper right section.

I am using a view as input source. However, when trying to execute then I am getting error that the view is not accessible, even though it exists in DB.

Views have to be refreshed from time-to-time. On your warehouse, please recreate them and also execute a select * on the same.

I got this error during execution: processing no result iterator: pq: cannot change number of columns in view.

The output view name already exists because of another project. To fix this, please drop the view or change its name, and try again.

I got this error during execution: creating Latest View of moldel 'model_name': processing no result iterator: pq: cannot change data type of view column "valid_at"

Please drop the view domain_profile in your warehouse and execute your command again.

I am getting error: processing no result iterator: pq: column "rudder_id" does not exist.

This occurs when you execute a project with same model name as yours, having main_id in it, and then you ran another project with same model name and no main_id in it. To resolve, you can drop earlier materials using cleanup materials command.

When are the id type filters from pb_project.yaml applied? Is it after the UPPER(SUBSTRING(context_page_path,2,2)) || '-' || user_id ?

Yes, assuming that the UPPER(SUBSTR...) is the SELECT SQL defined in the input, the filtering occurs after that step. The flow is as follows:

The SQL defined in the inputs selects the raw values from the input table into the edge table.
The edge table is processed based on the filters specified in the pb_project.yaml file to create the stitching-ready table.
The cleaned edge table is utilized to build the ID graph.

Should we rerun the stitching process from a clean slate once all user_id ‘s have been sorted out with market prefixes? This ensures that these users are captured separately instead of being grouped under one rudder_id.

It is recommended to use the --rebase-incremental flag and re-run the stitching process from scratch. While it may not be necessary in all cases, doing so ensures a fresh start and avoids any potential pooling of users under a single rudder_id. It’s important to note that if you make any changes to the YAML configuration, such as modifying the entity or model settings, the model’s hash will automatically update. However, some changes may not be captured automatically (e.g., if you didn’t change YAML but simply edited column values in the input table), so manually rebasing is a good practice.

When running, I get the error “Could not find parent table for alias “<DATABASE OBJECT NAME>”.

This is due to trying to access cross-database objects (views/tables) for inputs, which is only supported on Redshift RA3 node type clusters. See here for reference.

To resolve, kindly upgrade cluster to RA3 node type or copy data from source objects to DB specified in the siteconfig file.

I have a source table in which email is sometimes getting stored in the column for user_id, so that field has a mix of different ID types. I have to tie it to another table where email is a separate field. When doing so, I am having two separate entries for email, as type email and user_id.

Kindly implement the following line in the inputs tables that this occurs in:

- select: case when lower(user_id) like '%@%' THEN lower(user_id) else null end
  type: email
  entity: user
  to_default_stitcher: true

How do I validate the results of ID stitching?

Please contact us if you need help in validating the clusters.

I am an ecommerce company, setting up a project. Which identifiers would you recommend that I include in the ID stitcher?

We suggest including identifiers that are unique for every user and can be tracked across different platforms / devices. These identifiers might include but not limited to:

Email ID
Phone number
Device ID
Anonymous ID
Usernames

These identifiers can be specified in the file profiles.yaml file in the identity stitching model.

Remember, the goal of identity stitching is to create a unified user profile by correlating all of the different user identifiers into one canonical identifier. So that all the data related to a particular user or entity can be associated with that user or entity.

Feature Table

How can I run a feature table without running its dependencies?

For instance, take the case that you have to re-run the user entity_var days_active and the rsTracks input_var last_seen; for a previous run with seq_no 18. Then you can execute the following command:

$ pb run --force --model_refs entity/user/days_active,inputs/rsTracks/last_seen --seq_no 18

Is it possible to run the feature table model independently, or does it require running alongside the id stitcher model?

If you provide a specific timestamp for the run, instead of using the default latest time, PB will recognize if you have previously executed an id stitcher for that time. So, it will reuse that table instead of generating it again.

Therefore, you can execute a command similar to the following: pb run --begin_time 2023-06-02T12:00:00.0Z --end_time 2023-06-03T12:00:00.0Z. Please note that:

The timestamp value must match exactly as when it was run, for reusing a specific ID stitcher.
If you have executed ID Stitcher in incremental mode and do not have an exact timestamp for reusing a specific ID stitcher, you can select any timestamp greater than a non-deleted run. This is because subsequent stitching will take less time.
To perform another ID stitching using PB, pick a timestamp (e.g. 1681542000) and stick to it while running the feature models. For example, the first time you execute pb run --begin_time 2023-06-02T12:00:00.0Z --end_time 2023-06-03T12:00:00.0Z, it will run the ID stitcher along with the feature models. However, in subsequent runs, it will reuse the ID stitcher and only run the feature models.

I am unable to create feature table, getting this error: Material needs to be created but could not be: processing no result iterator: 001104 (42601): Uncaught exception of type 'STATEMENT ERROR': 'SYS _W. FIRSTNAME' in select clause is neither an aggregate nor in the group by clause.

This error occurs when you use a window function any_value that requires a window frame clause. For example,

- entity_var:
    name: email
    select: LAST_VALUE(email)
    from: inputs/rsIdentifies
    window:
      order_by:
      - timestamp desc

I have multiple models in my project. Can I run a single model?

Kindly add materialization as disabled to a model, and it won’t run. In your specifications YAML file, for that particular model:

materialization:
  enable_status: disabled

Is it possible to create a feature out of an identifier? For example, I have a RS user_main_id with two of our user_ids stitched to it. Only one of the user_ids has a purchase under it. Is it possible to show that user_id in the feature table for this particular user_main_id?

If you know which input/warehouse table served as the source for that particular ID Type, then you can create features from any input and also apply a where clause within the entity_var. For example, you could create an aggregate array of user_id’s from the purchase history table, where total_price > 0 (exclude refunds, for example). Or, if you have some LTV table with user_id’s, you could exclude LTV < 0 (say).

API

How can I make Profiles work with the API?

You need a successful run that is not past the retention period. Then, toggle the API in Settings. After that a Profile run will sync automatically.

I am getting an error when trying to enable API in my instance, for a custom project hosted on GitHub.

For GIT projects, you would need to explicitly add which id needs to be served in the project, in pb_project.yaml:

entities:
  - name: user
    serve_traits:
      - id_served: user_id
      - id_served: anonymous_id
      - id_served: email
      - id_served: cart_token
      - id_served: user_main_id

UI (Web App)

When I navigate to the screen, I am not seeing any content, only a blank loading screen.

That’s because you do not have suffient access. Enable Edit access, that should help.

When trying to fetch data for a lib project, then data/columns are coming as blank.

You’ll need to sync data from a source => destination. If data is synced from the source you are using and not using pre-existing tables in the destination, these missing column/data issues shouldn’t happen.

Despite being an admin, I am not able to see Unify tab on the web app.

Please disable any adblockers on the web browser.

ML / Python Models

Despite deleting WhtGitCache folder and adding keys to siteconfig, I am getting this error: Error: loading project: populating dependencies for project:base_features, model: churn_30_days_model: getting creator recipe while trying to get ProjectFolder: fetching git folder for git@github.com:rudderlabs/rudderstack-profiles-classifier.git: running git plain clone: repository not found

If your token is valid then replace git@github.com:rudderlabs/rudderstack-profiles-classifier.git with https://github.com/rudderlabs/rudderstack-profiles-classifier.git in the file profile-ml.

YAML

Are there any best practices I should follow when writing in YAML?

Please keep these points in mind, otherwise you may get an error.

Use spaces instead of tabs.
Always use proper casing. Say id_stitching and not id_Stitching.
Make sure that the source table you are referring to, exists on DW or data has been loaded into it.
If you’re pasting table names from your Snowflake console, then remove the double quotes in the inputs.yaml file.
The syntax you have written is correct, as shown in sample code.
Indentation is meaningful in YAML, so please make sure that the spaces have same level as given in sample files.

Also, you may check YAML 101 Guide.

My YAML has many features. How do I debug step-by-step. How do I run upto a particular feature or feature/macro/tablevar?

There is a parameter --model_args in the format modelName:argType:argName in the command pb run. It allows you to run till a specific feature/tablevar. For example:

$ pb run -p samples/attribution --model_args domain_profile:breakpoint:blacklistFlag

Note that this is only applicable to versions prior to v0.9.0.

When referencing another entity_var in a macro, can I use double quotes?

You can use escape character. e.g.,

- entity_var:
    name: days_since_last_seen
    select: "{{macro_datediff('{{user.Var(\"max_timestamp_bw_tracks_pages\")}}')}}"

Also, say you have a case statement. Then you may do like this: select: CASE WHEN {{user.Var("max_timestamp_tracks")}}>={{user.Var("max_timestamp_pages")}} THEN {{user.Var("max_timestamp_tracks")}} ELSE {{user.Var("max_timestamp_pages")}} END

Are there any tools available for editing YAML?

If you are looking for YAML tools, you may want to check these out:

Control Access

I have two separate roles to read from input tables and write to output tables? How the roles should be defined?

You need to create an additional role specified as a union of the two roles. PB runs need to be able to read the input tables and write results back to a schema in the warehouse. Furthermore, each run is executed using a single role, specified in the matching connection’s section of the site config. It is best in terms of security to create a new role which has read access to all relevant inputs and write access to the output schema. Alternative is to reuse an existing role which has atleast those permissions.

How do I test if the used role have sufficient privileges to access the objects in warehouse to run the project?

You can use the pb validate access command to validate the access privileges on all the input/output objects to the used role. See validate section of CLI Reference for more information.

Setup & Installation

I installed Python3, yet when I install and execute pb it doesn’t return anything on screen.

Please restart your Terminal/Shell/PowerShell and then try again. If the issue is not resolved, please restart your machine.

I am an existing user who updated to the new version and now I am unable to use your tool. On Windows I am getting error: 'pb' is not recognized as an internal or external command, operable program or batch file.

Please execute the following commands to do a fresh install:

pip3 uninstall profiles-rudderstack-bin (in case you’re on version 0.8.0 or above)
pip3 uninstall profiles-rudderstack
pip3 install profiles-rudderstack --no-cache-dir

I am unable to download, getting ERROR: Package 'profiles-rudderstack' requires a different Python: 3.7.10 not in '>=3.8, <=3.10'

Kindly update your Python3 to a version greater than or equal to 3.8, and less than or equal to 3.10.

I am unable to download via pip3 install profiles-rudderstack even though I have Python installed.

Firstly, please make sure that python3 is correctly installed. You can also try to substitute pip3 with pip and execute the install command.

If that doesn’t work, then it’s likely that Python3 is accessible from a local directory.

Kindly navigate to that directory and try the install command again.
After installation, PB should be accessible from anywhere.
Validate that you’re able to access the path using which pb.
You may also execute echo $PATH to view current path settings.
In case it doesn’t show the path then you can find out where PB is installed using pip3 show profiles-rudderstack. This command will display a list of the files associated with the application, including the location in which it was installed. Navigate to that directory.
Now, navigate to /bin subdirectory and execute command ls to confirm that pb is present there.
To add the path of the location where PB is installed via pip3, the command export PATH=$PATH:<path_to_application> should be used. This command will add the path to your system’s PATH variable, making it accessible from any directory. It is important to note that the path should be complete and not simply relative to the current working directory.

If it is still not possible to install the application, then you can try to install manually. Please contact us for executable file and download on your machine. Follow these steps after that:

If directory does not exist then create: sudo mkdir /usr/local/rudderstack.
Move the downloaded file to that directory: sudo mv <name_of_downloaded_file> /usr/local/rudderstack/pb.
Give executable permission to the file: chmod +x /usr/local/rudderstack/pb.
(MacOS) On Finder, navigate to directory /usr/local/rudderstack by clicking on Go: Go To Folder. Ctrl+Click on pb and select “Open” to run it from Terminal.
Symlink to a filename pb in /usr/local/bin so that command can locate it from env PATH. Create file if it does not exist: sudo touch /usr/local/bin/pb. Then execute sudo ln -sf /usr/local/rudderstack/pb /usr/local/bin/pb.
Verify the installation by running pb in Terminal. In case you get error command not found: pb then check if /usr/local/bin is defined in PATH by executing command: echo $PATH. If not then add /usr/local/bin to PATH.

If the Windows firewall prompts you after downloading, proceed with Run Anyway.
Rename the executable as pb.
Move the file to a safe directory such as C:\Program Files\Rudderstack. Create the directory if not present.
Set the path of pb.exe file in environment variables.
Verify the installation by running pb in command prompt.

When I try to install using pip3 then I get error message saying: Requirement already satisfied.

Kindly uninstall (pip3 uninstall profiles-rudderstack) and then install the app again (pip3 install profiles-rudderstack). Note that this won’t remove your existing data such as models and siteconfig files.

Despite successful installation, I am unable to run the program pb.

You will have to specify location in the path to allow pb to be retrieved from there.