For access to Heap Connect, contact your Customer Success Manager or sales@heap.io.

Heap Connect lets you directly access your Heap data using SQL in your Redshift warehouse. You can run ad-hoc analyses, connect to BI tools such as Tableau, or join the raw Heap data with your own internal data sources.

Best of all, we automatically keep the data up-to-date and optimize its performance for you. Define an event within the Heap interface, and in just a few hours, you’ll be able to query it retroactively in a clean SQL format.

Note: Currently, we do not sync initial user properties to the users table in Redshift.

Setup

The Redshift integration does not support enhanced VPC endpoints

To connect Heap to Redshift, proceed as follows:

1. Create Heap User

To set up Heap to sync to your Redshift cluster, you’ll need to create a user for Heap within the database of choice who at a minimum has CREATE and USAGE privileges. If possible, we do prefer a superuser, as it will help with performance optimization.

We recommend the following cluster settings based on app traffic.

Monthly Pageviews (estimating from sessions = # of monthly sessions x 10)	Recommended minimum Redshift cluster size	Maximum # of tables to reliably load in under 4 hours
300-600M	64 x `dc2.large` nodes	20
150-300M	32 x `dc2.large` nodes	30
< 75M	4 x `dc2.large` nodes	35
75-150M	8 x `dc2.large` nodes	35

We require access to system pg_table_def. Optimally, we’d also like to have access to stv_partitions to check the remaining disk size (although this is not required).

2. Update Table Permissions

Once the Heap user is created, you’ll need to update permissions for users that access your Redshift cluster by updating the default privileges for tables created by our Heap user in future syncs. Please run the following command as any superuser after creating the Heap user: ALTER DEFAULT PRIVILEGES FOR USER <heap_username> GRANT SELECT ON TABLES TO <target_user/group>

Please note that you’ll need to specify any other permissions you’d like to provide your target users or target group if users require other permissions beyond SELECT for the Heap data.

3. Allowlist Heap IP Address

Additionally, you’ll need to make sure that you’ve allowlisted connections from the following IP address: 52.20.48.42

4. Fill in your credentials on the Redshift setup page in Heap

If you have been granted access to Heap Connect and you do not see the setup page, please reach out via the Get support page. You need to know the following information about your Redshift cluster. This information is available on the Redshift configuration page:

Endpoint (host:port format, port is generally 5439)
Username
Password
Database Name
Schema Name (default to “main_production”)

You can learn more about how the data will be structured upon sync by viewing our docs on data syncing.

Note: It may take up to 24 hours for us to prepare your account for syncing. This is noted in the ‘sync pending’ state.

Note: The schema name defaults to snake-cased _. The name can be changed, but do not specify an existing schema in your database. The Heap schema must be a separate, new schema, that will be created upon initial sync, and no tables or views can be manually added to the Heap schema.

You can sync Defined Properties to this warehouse.

SSH Tunneling (Optional)

For the strongest encryption between Heap and your database, you can create a SSH tunnel to a tunnel server to connect Heap to your Redshift warehouse.

Allowlist IP Address

You’ll need to make sure that you’ve allowlisted connections from the following IP address: 52.20.48.42

Prepare Tunnel Host

You will need to prepare your host by creating a heap user and adding the Heap public key (see below) to the heap .ssh/authorized_keys file, as follows:

Create heap group: sudo groupadd heap
Create heap user and its home directory: sudo useradd -m -g heap heap
Switch to heap user: sudo su - heap
Create the .ssh directory: mkdir ~/.ssh
Set permissions: chmod 700 ~/.ssh
Change to the .ssh directory: cd ~/.ssh
Create the authorized_keys file: touch authorized_keys
Set permissions: chmod 600 authorized_keys

Heap Public Key

The following key should be used as the Heap public key:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEe2sdAe1c+6fnoVMkbLCnJY56IPUZMsbTnch5MmmwAOwr/DDMaamQDhXNoZMl5OwF4dTSoMpG5z8qwGfrH5AGINfSaUvGyIRCNHatTYHd6pc81+ztLYkk82/3Z2sgrrfV/k3v+Yt/NMy0Esdo+x+kTr05XhvhbQ02BEm2TXwmLzKcNbQfkO/X16npkrQmAu0kqpo438hQgT/gmQv0KDGIgtUwfur7QmqrgwlaAUY7Y/zHJtFio1nglxztcsjT/iMTxhYzXaGR8yxdqmsJf21oscfQy7rvVJU7ePImvPgXfhG+zZTcdIGh+673BIP7Av6XMc0Mb15lMGrhq884SdDp support@heap.io

Using your favorite text editor, add the ssh key provided by Heap to authorized_keys file. The key must be all on one line.

Provide SSH Tunnel Metadata

Along with the basic configuration details (see below), you should provide to Heap the following information about the SSH tunnel configuration:

IP address or DNS name of the tunnel server as seen from the public internet
SSH port of the tunnel server as seen from the public internet
Username on the tunnel server for the SSH connection (the standard is ‘heap’)

Database Configuration

Once completed, you’ll also need to ensure that you follow the database configuration steps and provide the requested metadata in our Setup section.

What to expect from your first sync

Once you have successfully configured Redshift, you can see your sync status on the Redshift page in Heap.

How fast you can expect the first sync to occur

The first sync can take up to 48 hours, depending on how much data you have. Subsequent syncs happen every 24 hours. Click the Configuration (gear) icon on the Redshift page to select a sync start time.

Reach out via the Get support page if you have any questions or concerns about your first sync.

What tables you will see downstream

The Redshift connection syncs the Heap data to your downstream connection. By default it syncs 3 data tables (pageviews, users, sessions) and some additional metadata tables like user_migrations and sync_info tables.

How to sync more events downstream

Go to Data > Labeled events and click on the event you want to sync. Under Connect, select the toggle for Redshift

2. On the pop-up that appears, select either All Time (since the day Heap was installed) or Date To Now. If you click Date To Now, a small calendar will appear where you can select a starting date.

3. The toggle you selected will now be green and read “Pending initial sync to Redshift” when you hover over it. The sync for this event doesn’t happen immediately; it will happen at your next scheduled sync.

Performance Optimization

Under the hood, we tune your Redshift schema to ensure that it is as performant as possible. In particular, this means:

The sort key on each table is the time field. This ensures that recent data gets accessed more quickly than older data.
The distribution key on each table is the user_id field. This collocates user and event data for a given user_id on the same node slice, which ensures that user level joins between the users table and event tables are as performant as possible.
Automatic compression is enabled. This ensures that columnar data is optimally encoded, which reduces disk I/O and improves query performance.

To learn more about concepts such as column compression and node distribution, take a look at the Amazon Redshift docs.

Redshift Permissions Troubleshooting

If you didn’t apply default privileges as part of your Connection Requirements, some users may lose permissions on some tables when:

An event definition changes (any change except the name)
A table sync is toggled off and on again (in the Event view or the Heap Connect view)
Weekly maintenance tasks are run (VACUUM, etc)

Redshift doesn’t currently provide an easy way to re-apply table-level permissions in these cases, so we recommend applying schema-level permissions as part of your setup.

Set Permissions For All Users

Typically, users will have one of two permission sets: ALL and SELECT. You can read more about Redshift permissions in Redshift’s documentation.

To correctly set ALL permissions for your users, run the following three commands on your cluster in the database containing Heap’s schema as a superuser:

GRANT ALL ON SCHEMA {schema_name} TO {comma-separated list of users};
GRANT ALL on ALL TABLES IN SCHEMA {schema_name} TO {comma-separated list of users};
ALTER DEFAULT PRIVILEGES FOR USER {heap_user} GRANT ALL ON TABLES TO {comma-separated list of users};

To correctly set SELECT permissions for your users, run the following three commands on your cluster in the database containing Heap’s schema:

GRANT USAGE ON SCHEMA {schema_name} TO {comma-separated list of users};
GRANT SELECT on ALL TABLES IN SCHEMA {schema_name} TO {comma-separated list of users};
ALTER DEFAULT PRIVILEGES FOR USER {heap_user} GRANT SELECT ON TABLES TO {comma-separated list of users};

This should ensure that the correct permissions persist when we drop and recreate tables. Please contact us via the Get support page if you have any questions.

Table of Contents

Was this article helpful?

Redshift Integration

In this article you'll learn:

Setup

1. Create Heap User

2. Update Table Permissions

3. Allowlist Heap IP Address

4. Fill in your credentials on the Redshift setup page in Heap

SSH Tunneling (Optional)

Allowlist IP Address

Prepare Tunnel Host

Heap Public Key

Provide SSH Tunnel Metadata

Database Configuration

What to expect from your first sync

How fast you can expect the first sync to occur

What tables you will see downstream

How to sync more events downstream

Performance Optimization

Redshift Permissions Troubleshooting

Set Permissions For All Users