Schematized Snowflake Pipeline

Customers on an Enterprise or Growth plan can access Data Pipeline as an add-on package. See our pricing page for more details.

This guide describes how Mixpanel data is exported into a Snowflake dataset. Create a pipeline to export your Mixpanel data into Snowflake. Once an export job is scheduled, Mixpanel exports data to Snowflake on a recurring basis.

Design

⚠️

Mixpanel exports data into customer-managed Snowflake. Mixpanel-hosted Snowflake (Data Sharing) is no longer supported for new pipelines. Only password-based authentication is supported via the API; key-pair authentication is not available.

Mixpanel stages exported data in a GCS bucket before loading it into your Snowflake warehouse. You need to set up a role, storage integration, and user in your Snowflake account before creating a pipeline.

Set Export Permissions

Step 1: Create a Role and Grant Permissions

Create a role (MIXPANEL_EXPORT_ROLE as an example) and grant access on your database, schema, warehouse to the role. Replace <database name>, <schema name>, <warehouse name> with actual names.

CREATE ROLE MIXPANEL_EXPORT_ROLE;
GRANT ALL ON DATABASE <database name> TO ROLE MIXPANEL_EXPORT_ROLE;
GRANT ALL ON SCHEMA <database name>.<schema name> TO ROLE MIXPANEL_EXPORT_ROLE;
GRANT USAGE ON WAREHOUSE <warehouse name> TO ROLE MIXPANEL_EXPORT_ROLE;
GRANT OPERATE ON WAREHOUSE <warehouse name> TO ROLE MIXPANEL_EXPORT_ROLE;
GRANT MONITOR ON WAREHOUSE <warehouse name> TO ROLE MIXPANEL_EXPORT_ROLE;

Step 2: Create a Storage Integration

Mixpanel stages exported data in a GCS bucket (gcs://mixpanel-export-pipelines-<project-id>) before loading it into your warehouse. This bucket is created and managed by Mixpanel - you do not need to create it yourself. To allow Mixpanel to load data from this bucket into your Snowflake warehouse, create a GCS storage integration and grant it to the role. Replace <project-id> with your Mixpanel project ID.

CREATE STORAGE INTEGRATION MIXPANEL_EXPORT_STORAGE_INTEGRATION
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'GCS'
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ("gcs://mixpanel-export-pipelines-<project-id>");
GRANT USAGE ON INTEGRATION MIXPANEL_EXPORT_STORAGE_INTEGRATION TO MIXPANEL_EXPORT_ROLE;

Step 3: Authenticate the User

Create a user with a password and grant the role to the user. If you already have a user, change the fields and grant the role.

CREATE USER MIXPANEL_EXPORT_USER PASSWORD='<password you provided>' DEFAULT_ROLE=MIXPANEL_EXPORT_ROLE;
 
ALTER USER MIXPANEL_EXPORT_USER SET PASSWORD='<password you provided>';
GRANT ROLE MIXPANEL_EXPORT_ROLE TO USER MIXPANEL_EXPORT_USER;

For Snowflake export, we first load the data into a single-column raw (VARIANT type) data table. This data will be a transformed version of the raw data stored in Mixpanel. Then, we create a view to expose all properties as columns. The view name is the cleaned version of the event name and the raw table name is always the view name suffixed with _raw. For example, for signup event, you will have a signup_raw table and a signup view. Please see Schema for general information about the schemas in Schematized Export Pipelines.

Partitioning

The data in the raw tables is clustered based on time column but in project’s timezone. To be exact, we use CLUSTER BY (TO_DATE(CONVERT_TIMEZONE('UTC','<TIMEZONE>', TO_TIMESTAMP(DATA:time::NUMBER))) where TIMEZONE is the Mixpanel project’s timezone.

Queries

Mixpanel recommends you place all events into a single table to make querying easier. To get more information about the table schemas, please see Schema.

A query is a request for data results. You can perform actions on the data, such as combine data from different tables; add, change, or delete table data; and perform calculations.

Snowflake supports a VARIANT type that can store JSON objects and arrays. Mixpanel exposes array and object top-level properties as VARIANT columns in the view.

Here is an example of how you can query the raw table when using one table for all the events:

SELECT count(*)
FROM mixpanel.mp_master_event_raw
WHERE data:mp_event_name::string = “Signup”;

Here is an example of how you can query the view when using one table for all the events:

SELECT count(*)
FROM mixpanel.mp_master_event
WHERE mp_event_name = “Signup”;

Here is an example of how you can query the raw table when using multiple tables for the events:

SELECT count(*)
FROM mixpanel.signup_raw
WHERE data:distinct_id::string = “1”;

Here is an example of how you can query the view when using multiple tables for the events:

SELECT count(*)
FROM mixpanel.signup
WHERE distinct_id = “1”;

Getting the number of events in each day

You will need this if you suspect the export process is not exporting all the events you want. As time column in the tables is in UTC timezone, you first need to convert that to your Mixpanel project timezone, and then, get the number of events for each day. The following query will do that for you.

SELECT TO_DATE(CONVERT_TIMEZONE('UTC','<PROJECT_TIMEZONE>', time)) as ttime, count(*) 
FROM "<DB_NAME>"."PUBLIC"."MP_MASTER_EVENT"
WHERE ttime>=TO_DATE('2021-12-03') AND ttime<=TO_DATE('2021-12-07')
GROUP BY ttime

This example returns the number of events in each day in project timezone for a monoschema export pipeline and an example date range. PROJECT_TIMEZONE and DB_NAME should be replaced by your Mixpanel project timezone and your snowflake database name. You can adjust the query for multi-schema by putting the right table name in the query.

Querying the identity mapping table

When using the ID mappings table, you should use the resolved distinct_id in place of the non-resolved distinct_id whenever present. If there is no resolved distinct_id, you can then use the distinct_id from the existing people or events table.

Below is an example SQL query that references the ID mapping table to count number of events in a specific date range for each unique user in San Francisco

SELECT
 CASE
     WHEN m.resolved_distinct_id IS NOT NULL THEN m.resolved_distinct_id
     WHEN m.resolved_distinct_id IS NULL THEN e.distinct_id
 END as resolved_distinct_id,
COUNT(*) AS count
FROM "<DB_NAME>"."PUBLIC"."MP_MASTER_EVENT" e FULL OUTER JOIN "<DB_NAME>"."PUBLIC"."MP_IDENTITY_MAPPINGS_DATA" m
ON e.distinct_id = m.distinct_id
AND mp_city="San Francisco"
AND TO_DATE(CONVERT_TIMEZONE('UTC','<PROJECT_TIMEZONE>', e.time)) >= TO_DATE("2020-04-01")
AND TO_DATE(CONVERT_TIMEZONE('UTC','<PROJECT_TIMEZONE>', e.time)) <= TO_DATE("2020-04-03")
GROUP BY resolved_distinct_id
LIMIT 100

Counting number of times a user has done a specific behavior is also possible by adding more filters on event properties. You can adjust the query for multi-schema by putting the right table name in the query.

Schematized GCS Pipeline Cohort Sync

Was this page useful?