Redshift
Amazon Redshift is a fast, fully managed, scalable, and cost-effective cloud data warehouse that enables you to analyze exabytes of data and run complex analytical queries for business insights. With Redshift, you can run and scale analytics on your data within seconds without having to manage your data warehouse infrastructure. Fivetran supports both Redshift provisioned and Redshift Serverless data warehouses as destinations for lower time to value from your data.
Setup guide
Follow our step-by-step Redshift setup guide to connect your Redshift data warehouse with Fivetran.
Type transformation mapping
The data types in your Redshift data warehouse follow Fivetran's standard data type storage.
We use the following data type conversions:
Fivetran Data Type | Destination Data Type | Notes |
---|---|---|
BOOLEAN | BOOLEAN | |
SHORT | SMALLINT | |
INT | INTEGER | |
LONG | BIGINT | |
BIGDECIMAL | DECIMAL | |
FLOAT | REAL | |
DOUBLE | DOUBLEPRECISION | |
LOCALDATE | DATE | |
LOCALDATETIME | TIMESTAMP | |
INSTANT | TIMESTAMP_TZ | For source connectors created before August 30, 2021, we convert INSTANT to TIMESTAMP. |
STRING | VARCHAR or TEXT | VARCHAR if bytelength is present, else TEXT. |
JSON | VARCHAR | |
BINARY | VARBYTE | For source connectors created before September 1, 2022, we convert BINARY to VARCHAR as the Base64 encoding of the raw bytes. |
Optimize Redshift
Add sort and distribution keys
Fivetran automatically infers the primary, foreign, sort, and distribution (dist) keys whenever possible. The data from cloud application connectors should automatically have all four keys. For database connectors, we add the primary and foreign keys (if they are present in the source), but not the dist and sort keys.
You can add Redshift's sort key and dist key to optimize query performance and improve sync speeds. For example, if you set the primary key as a sort key, it will result in the following:
- Faster delete operations of the common records
- Improved sync speeds during upsert operations
- Quicker analysis queries
The following steps apply to all data sources. In the example below, we'll use the EVENT
table from the Snowplow connector. Suppose you want to apply sort and dist keys to the snowplow.event
table. You have to create a new table that's the same as the original table, but with sort and dist keys:
Pause the connector (Snowplow in our example) from the Fivetran dashboard. On the connector dashboard, set the connector status toggle to PAUSED, ensuring that a sync isn't taking place.
Log in to your database and get the schema of the original table using the
\d snowplow.event
command. This command will render the schema of the table. For example,Column Type event_id TEXT app_id TEXT br_colordepth INTEGER br_cookies BOOLEAN br_features_director BOOLEAN br_features_flash BOOLEAN br_features_gears BOOLEAN br_features_java BOOLEAN br_features_pdf BOOLEAN br_features_quicktime BOOLEAN br_features_realplayer BOOLEAN Write a
create table
command to re-create the table. Add the PRIMARY KEY and FOREIGN KEY constraints:create table snowplow.event_copy ( app_id CHARACTER VARYING(256) Primary Key, br_colordepth INTEGER references another_table(id), br_cookies BOOLEAN, br_features_director BOOLEAN, br_features_flash BOOLEAN, br_features_gears BOOLEAN, br_features_java BOOLEAN, br_features_pdf BOOLEAN, br_features_quicktime BOOLEAN, br_features_realplayer BOOLEAN, );
Get the compression encodings using the
analyze compression
command. For example,analyze compression snowplow.events;
. The command will run for a little while and result in the following:Table Column Encoding event app_id lzo event br_colordepth lzo event br_cookies raw event br_features_director raw event br_features_flash raw event br_features_gears raw event br_features_java raw event br_features_pdf raw event br_features_quicktime raw event br_features_realplayer raw Update the
create table
command with the compression encodings and the sort and dist keys. Execute thecreate table
command:create table snowplow.event_copy ( event_id CHARACTER VARYING(256) primary key distkey encode lzo, -- we join custom contexts on this, so it should be the dist key dvce_sent_tstamp TIMESTAMP WITHOUT TIME ZONE sortkey encode lzo, -- we often select on this, so it should be the sort key dvce_created_tstamp TIMESTAMP WITHOUT TIME ZONE encode lzo, app_id CHARACTER VARYING(256) encode lzo Primary Key, br_colordepth INTEGER encode lzo references another_table(id), br_cookies BOOLEAN, br_features_director BOOLEAN, br_features_flash BOOLEAN, br_features_gears BOOLEAN, br_features_java BOOLEAN, br_features_pdf BOOLEAN, br_features_quicktime BOOLEAN, br_features_realplayer BOOLEAN, );
Copy the data from master table to the copy table:
insert into snowplow.event_copy (select * from snowplow.event); drop table snowplow.event; alter table snowplow.event_copy rename to event; commit transaction;
Verify that the sort and dist keys are in place:
SET search_path to 'snowplow'; SELECT "column", type, encoding, distkey, sortkey FROM pg_table_def WHERE tablename = 'event';
column type encoding distkey sortkey app_id TEXT lzo f 0 br_colordepth INTEGER lzo f 0 br_cookies BOOLEAN none f 0 br_features_director BOOLEAN none f 0 br_features_flash BOOLEAN none f 0 br_features_gears BOOLEAN none f 0 br_features_java BOOLEAN none f 0 br_features_pdf BOOLEAN none f 0 br_features_quicktime BOOLEAN none f 0 br_features_realplayer BOOLEAN none f 0
Shrink large schemas
You can check if you have a large Redshift schema using the following query:
SELECT
schema,
sum(size) AS size_mb,
sum(tbl_rows) AS rows,
sum(size) / sum(tbl_rows) AS avg_row_size_mb
FROM svv_table_info
GROUP BY 1
ORDER BY 2 DESC
If you used the VACUUM
command but still observe large (size_mb > 10,000) and inefficient (avg_row_size_mb > 0.010) schemas, you should fix these schemas by making a deep copy of every table in the schema.
NOTE: The following process is time-consuming, but it should significantly reduce the size of your tables. Once you make a deep copy, your schema sizes will not exceed the standard levels. As an added benefit, this process ensures that all of your columns have the best compression encoding, making your queries execute faster.
Use Amazon's columnar compression utility to generate SQL that will perform the deep copy. The exact command is:
./analyze-schema-compression.py --db [your database] --db-user [your user] --db-host [your host] --db-port [your port, usually 5439] --analyze-schema [schema to compress] --analyze-table [table to compress] --comprows 100000 --slot-count 1 --force true --drop-old-data true --output-file deep_copy.sql
The command will generate a file named
deep\_copy.sql
that contains a bunch of SQL statements.Run the statements in
deep_copy.sql
to perform the deep copy and drop the old data.The commands will take a long time to execute. On completion, any views that reference these tables are dropped. You have to re-create them.
NOTE: The column compression script makes you the owner of all the tables that get compressed.
Assign ownership of the table back to Fivetran. For each table in the schema, run:
ALTER TABLE [schema].[table] OWNER TO 'fivetran'
NOTE: If the Fivetran user is something other than 'fivetran', replace the default 'fivetran' value.
Connect as master or limited user
There are two ways to allow Fivetran to access your Redshift data warehouse:
Connect as a Master user: Fivetran will have access to your data at all times. The benefit of this is that it's effortless to set up - there's no additional configuration required. The con of connecting this way is that this user cannot be restricted. The implication of this is that you will not be able to apply workload management settings on that user. Workload management enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.
Connect as a Limited User: Fivetran will adhere to the privileges that you set. To do this, you'll need to create a separate, specific Fivetran user in your Redshift data warehouse. This user must have CREATE permissions. The benefit of this is that Fivetran will only have access to the schemas that we deliver to in Redshift, and it will allow us to read and write to those schemas exclusively.
See our setup instructions for more information.
Migrate destinations
To migrate your Redshift cluster and database that is configured with Fivetran to another database on a different cluster, you must first decide whether you want to start from scratch or migrate the data that Fivetran has uploaded on the old cluster.
If the data sources are fast to re-sync, and the amount of data in the old database is relatively small, then we suggest starting over and re-syncing everything:
- Pause all your connectors from the Fivetran dashboard. On the connector dashboard, set the connector status toggle to PAUSED.
- Update the destination connection settings to specify the new cluster details. In the destination dashboard, click Edit connection details. Enter the new database name and provide the authentication details.
- Enable your connectors. On the connector dashboard, set the connector status toggle to ENABLED.
- For all your connectors, go to the Details tab and click Resync All Historical Data.
If you have large amounts of data or your connectors are highly API limited, the re-sync process will take a lot of time. In such cases, we suggest that you migrate the data from the old cluster to the new cluster using Redshift backups:
- Pause all your connectors from the Fivetran dashboard. On the connector dashboard, set the connector status toggle to PAUSED.
- Restore your new Redshift cluster from a backup of your old Redshift cluster. This is critical because Fivetran needs to have the same metadata available to continue syncing the data from the last sync point and not have to restart the entire sync again.
- Update the destination connection settings to specify the new cluster details. In the destination dashboard, click Edit connection details. Enter the new database name and provide the authentication details.
- Enable your connectors. On the connector dashboard, set the connector status toggle to ENABLED.
Data load costs
Redshift does not charge you extra when Fivetran loads data into your data warehouse.
Table limitations
Amazon Redshift limits the maximum number of tables that you can create. If you exceed the maximum limit, Fivetran can't create any further tables in the destination. As a result, the integration will fail, causing errors or data loss. See Redshift's documentation for more information about the maximum table limits.
System column name management
Redshift does not allow to create user-defined columns with the following names:
oid
tableoid
xmin
cmin
xmax
cmax
ctid
To avoid naming conflicts with the Redshift system rules, we prefix these column names with #
before writing them to the destination.