The open-source tool is one of the most powerful databases on the planet, with the ability to handle sophisticated analytical workloads and high levels of concurrency. That makes PostgreSQL (also called Postgres) a popular DB for scientific research and AI/ML projects. It’s also a popular production database for data-driven companies in every industry.
But no database is perfect. Postgres’s biggest limiting factor is scale: As a row-based RBDMS, it slows to a crawl as your database grows. That’s where Amazon Redshift comes in.
Redshift is the top choice for cloud-based data warehouses. Its columnar structure makes it ideal for processing large volumes of data for analytics and insights. That’s why many companies choose to replicate their Postgres DBs to Redshift. In this guide, we’ll look at four ways how you can replicate data from PostgreSQL to Redshift. But first, let’s look at why you should replicate your data to Redshift.
Table of Contents
- Why PostgreSQL to Redshift?
- How to Replicate Data from PostgreSQL to Amazon Redshift
- Getting the most out of PostgreSQL and Redshift
Why PostgreSQL to Redshift?
PostgreSQL and Redshift serve two separate functions. For companies with mid to large volumes of data, we recommend using both tools. Here’s why:
Speed to Insight. PostgreSQL is highly customizable for sophisticated analytical projects. However, as a row-based database, it’s slow to deliver analytical insights on large volumes of data. Redshift is a Massively Parallel Processing machine, which allows it to deliver insights in a fraction of the time.
Scalability. Again, as a row-based database, PostgreSQL fails to scale as effectively as the columnar-based Redshift. Imagine you have a massive table with 100 columns and 10 million rows, but you only care about 5 columns. With Postgres, you have no control over the fact that you must fetch all 100 columns for all 10 million rows, because columns are a substructure of each row. This results in slow queries, especially during exploratory or optimization-discovery phases.
Reliability. Postgres is known for its ability to handle concurrency, but as your data load grows, you risk crashing your production database while running analytical workloads. Pairing Redshift with Postgres keeps your data safe and applications running.
How to Replicate Data from PostgreSQL to Amazon Redshift
There are numerous ways to move data from PostgreSQL to Redshift, but not all are created equal. Let’s look at four methods for replicating Postgres to Redshift:
COPY and Import
Change Data Capture with Write-Ahead Log (WAL)
Amazon Data Management System (DMS)
3rd-Party ETL (such as Integrate.io)
COPY and Import
COPY is one of the most powerful commands in Redshift. It’s also a popular tool in PostgreSQL.
The COPY and Import is a manual replication method, which means it’s not continuous. However, COPY is useful for one-time or infrequent replications to Redshift.
The benefit of the COPY method over other replication tools like pg_dump is that COPY only imports changes to the database, where pg_dump imports the entire database each time.
There are four steps to replicating Postgres to Redshift with COPY and Import:
1. COPY data from Postgres to local drive
The first step is to export data from Postgres to your local drive. To do this, use the COPY command in your psql terminal. COPY the data you want to replicate to a CSV file and export it to your local drive.
Psql has two different COPY commands: /copy and COPY.
/copy is a meta-command that allows you to copy to a client-side server. COPY, on the other hand, is used to copy files server-side. You’ll want to use COPY to export the CSV to your local drive.
The command looks like this:
COPY public.customers TO '/home/john/customers.csv' FORMAT CSV HEADER true;
When COPYing your data, be sure to specify CSV at the end of the command.
Here’s a short introduction video on using COPY to export Postgres data as CSV files.
2. Split data into multiple files
Next, split your single CSV file into multiple files using the split program on *nix like systems. This step is to prepare your data to upload into S3 and then into Redshift. Amazon highly recommends splitting your data into multiple files to take advantage of AWS’s parallel processing power. Make sure to split your CSV file into the same number of files as your Redshift cluster has slices.
For example, each ra3.4xlarge compute node has four slices, and each ra3.16xlarge compute node has 16 slices. If you have a cluster with two ra3.4xlarge compute notes, you’ll want to split your CSV file into eight separate files.
Also, make sure that each file is roughly the same size—between 1 MB and 1 GB after compression.
3. Upload files into S3
Next, load the CSV files into S3.
Here’s a copy of the instructions from AWS:
Create a bucket in Amazon S3.
- Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
- Click Create Bucket.
- In the Bucket Name box of the Create a Bucket dialog box, type a bucket name. The bucket name you choose must be unique among all existing bucket names in Amazon S3. One way to help ensure uniqueness is to prefix your bucket names with the name of your organization. Bucket names must comply with certain rules. For more information, go to Bucket restrictions and limitations in the Amazon Simple Storage Service Developer Guide.
- Select a region. Create the bucket in the same region as your cluster. If your cluster is in the Oregon region, click Oregon.
- Click Create. When Amazon S3 successfully creates your bucket, the console displays your empty bucket in the Buckets panel.
Create a folder.
- Click the name of the new bucket.
- Click the Actions button, and click Create Folder in the drop-down list.
- Name the new folder load.
- Note: The bucket that you created is not in a sandbox. In this exercise, you add objects to a real bucket. You're charged a nominal amount for the time that you store the objects in the bucket. For more information about Amazon S3 pricing, go to the Amazon S3 pricing page.
Upload the data files to the new Amazon S3 bucket.
- Click the name of the data folder.
- In the Upload - Select Files wizard, click Add Files. A file selection dialog box opens.
- Select all of the files you downloaded and extracted, and then click Open.
- Click Start Upload.
4. COPY from S3 to Redshift
Finally, issue a COPY command in the Redshift query console to load the data from S3 into Redshift.
Here are the high-level steps to follow, with more information on AWS’s Developer Guide:
- Create the sample tables
- Run the COPY commands
- Vacuum and analyze the database
- Clean up your resources
A quick note on user credentials: The Amazon Redshift COPY command must have access to read the file objects in the Amazon S3 bucket.
Drawbacks using COPY Command
Using COPY to replicate your Postgres data to Redshift is much more efficient than other methods like pg_dump. The problem is that the process is manual and not continuous. You will need to perform the COPY command each time you want to update your data in Redshift.
The next method solves your continuous replication problem.
Change Data Capture with WAL (Write Ahead Log)
In this architecture, the developer uses AWS DMS to capture WAL data. More on DMS in the next section. Source: Alex Furlan
PostgreSQL’s Write-Ahead Log (WAL) is a critical and versatile tool. Its primary function is to ensure data integrity. In the event of a database crash, you can restore your data using the log. Database changes are continuously updated in WAL, so you can restore the most updated version of the database at any time.
You can also use the WAL to replicate data to Redshift through Change Data Capture (CDC) method.
Change Data Capture (CDC) is a technique that captures changes made to data in PostgreSQL and applies it to the destination Redshift table. It’s similar to COPY in that it only imports changed data, not the entire database.
Unlike Incremental SELECT & COPY, however, CDC with WAL allows you to achieve true replication of Postgres to Redshift.
The WAL not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from Postgres are also deleted in Redshift.
Replicating to Redshift using WAL requires you to COPY the WAL log to S3 and then upload it to Redshift.
How to Replicate to Redshift Using CDC with WAL
When using the CDC with WAL method, you’re actually building a custom ETL application. You don’t have to do this from scratch; depending on your Postgres version, you have a couple of options:
For Postgres Version 10+, all you have to do is set “wal_level to logical” and use a suitable library that connects to your Postgres DB and grabs the changes as they happen. Here are three examples of libraries you can use, but googling for “postgres logical replication
- Python library: https://pypi.org/project/pg2kinesis/
- Ruby libraries on Github: https://github.com/topics/logical-replication
- Node js library: https://github.com/kagis/pgwire#logical-replication
For Postgres Versions prior to 10, you will need to install plugins to make this happen, one of which is wal2json. These plugins can be a pain to install — or in the case of some third-party database providers, impossible to install.
After you have set up your Postgres DB to stream out changes, you will need a custom script to load changes into S3 as they arrive. Then you’ll need to run a script that pushes COPY commands to import the data from S3 to Redshift.
Drawbacks of CDC with WAL
While CDC with WAL can give you the continuous replication to Redshift you need, it’s difficult to build and maintain. Things like schema changes on the source end will always be a headache if not handled properly in the custom scripts.
The final two methods are pre-built, all-in-one solutions for your replication challenge.
Amazon Data Migration System (DMS)
The two methods described above are all “free.” Of course, you pay for them with the time and expertise required to set up and manage those methods.
Amazon DMS is a paid service, and with it you get a simple, all-in-one migration tool.
Here are the high-level steps for using Amazon DMS
- Step 1: Create a replication instance using the AWS DMS console
- Step 2: Specify source and target endpoints
- Step 3: Create a task and migrate data
- Step 4: Monitor your task
- Step 5: Deleting AWS DMS components using the console
For more information, visit the AWS docs.
Drawbacks of Amazon DMS
While Amazon DMS is a great tool, its continuous replication tends to lag over time. That’s why many data-heavy companies prefer a dedicated third-party ETL like Integrate.io.
Integrate.io makes replication from PostgreSQL to Redshift simple and painless. As a third-party ETL tool, Integrate.io is a turnkey data migration solution, so you can get back to building the business.
It is so easy to set up that you can do it over lunch:
Enable Write-Ahead Logging (WAL) on PostgreSQL (this is the default, so it should be on). Review Integrate.io’s support docs to do this on various database providers.
Provide Integrate.io with access information on your servers and with Amazon Redshift.
Integrate.io will automatically recreate your Postgres tables in Redshift.
Start the sync, sit back and watch Integrate.io do the work.
Integrate.io will load the data already present in your tables to Redshift for free. And then, whenever any data or schema change is made, Integrate.io will automatically replicate those changes to Redshift.
In addition, Integrate.io automatically maps Postgres data types into formats used by Redshift. This eliminates the need for your team to do this manually.
Getting the most out of PostgreSQL and Redshift
PostgreSQL and Redshift are both powerful tools on their own. Together, they make you nearly unstoppable.
PostgreSQL makes an excellent production database, especially when you regularly run sophisticated data analytic programs and have high levels of concurrency on the platform. However, speed and reliability become issues as your database grows. As a row-based database, it’s not built for gigabyte-level datastores. That’s where Redshift comes in.
With Redshift, you can store and analyze all of your company’s data in one place. Replicate your Postgres database to Redshift on a continuous basis to have the most updated insights on your business.
There are many ways to replicate Postgres to Redshift, but Integrate.io is the simplest and most reliable ETL available.
Our world-class, 24/7 customer support will proactively monitor your pipeline to ensure you’re always up and running.
See for yourself. Start your 14-day free trial of Integrate.io.