MySQL is the most popular open-source and free database in the world because it is powerful, flexible, and extremely reliable. But when it comes to data analytics, many companies turn to Amazon Redshift to complement MySQL. There are several ways to replicate your MySQL data to Redshift. But first, let’s dig a little deeper into why you should replicate your MySQL database to Redshift.
Five things to know about moving data from MySQL to Redshift:
- Tens of thousands of companies use MySQL to power their web-based applications and services.
- However, MySQL is quickly bogged down by even the smallest analytical queries, making data analytics difficult and putting your entire application at risk of crashing.
- Redshift is built to handle petabytes of data and deliver analytics in a fraction of the time.
- The benefits of moving data from MySQL to Amazon Redshift include maintaining application performance, scalability, and faster analytics.
- When you pair MySQL and Redshift, you also eliminate the risk of crashing your production database while running queries.
Table of Contents
- How to Move Data From MySQL to Redshift?
- How to Connect Amazon Redshift With MySQL
- Import & Export
- Incremental SELECT & COPY
- Change Data Capture with Binlog
- MySQL to Redshift Replication in Minutes
How to Move Data From MySQL to Redshift?
You can connect MySQL and Redshift for better performance, scalability, and other benefits. Many companies that use MySQL to power their web applications choose Redshift for data analytics.
There are four ways to move data between MySQL and Redshift: Import & Export, Incremental SELECT & COPY, Change Data Capture (CDC) with Binlog, and Integrate.io ETL.
Why Replicate Data From MySQL to Redshift?
There are several reasons why you should replicate MySQL data to the Redshift data warehouse.
- Maintain application performance. As we’ve already mentioned, running analytical queries on your production MySQL database can severely impact its performance. It could even cause it to crash. Analytical queries are incredibly resource-intensive and require dedicated computing power.
- Analyze ALL of your data. As an OLTP database (Online Transaction Processing), MySQL is designed for transactional data like customer records and financial data. However, you want to draw insights from your entire set of data, including the non-transactional type. You can use Redshift to capture and analyze all of your data in one place.
- Faster analytics. Redshift is a Massively Parallel Processing (MPP) data warehouse, which means it can crunch huge sets of data in a fraction of the time. MySQL, on the other hand, struggles to scale to the computing power needed for large, modern analytical queries. Even a MySQL replica database will struggle to achieve the same speed as Redshift.
- Scalability. MySQL was designed to work on a single-node instance, not modern distributed cloud infrastructure. Therefore, scaling beyond a single node requires time- and resource-intensive techniques like sharding or master-node setup. All of this slows down the database even further.
How to Connect Amazon Redshift With MySQL
Because of MySQL’s inherent weaknesses, many companies replicate data to Redshift for their analytics needs. There are four ways to accomplish this:
- Import & Export
- Incremental SELECT & COPY
- Change Data Capture (CDC) with Binlog
- Integrate.io ETL
Import & Export
The simplest way to replicate to Redshift is to export your entire MySQL data. However, this is also the least efficient method.
There are three steps:
- Import (or load)
Step 1: Export
To start, export data using MySQL’s mysqldump command. A typical mysqldump command looks like this:
$ mysqldump -h yourmysqlhost -u user mydatabase mytable1 mytable2 --result-file dump.sql
The output of this command is your MySQL SQL statement. You cannot run the SQL on Redshift as is — you’ll have to transform the statement into a format suitable for Redshift import.
Step 2: Transform
For the best upload performance, convert your SQL statement into TSV (tab-separated values) format. You can do this by using the Redshift COPY command.
The COPY command converts your SQL statement into TSV format. Then it batch uploads the files into a Redshift table in Amazon S3. For example, a row of data in your MySQL dump would look like this:
mysql> INSERT INTO `users` (`id`, `firstname`, `lastname`, `age`) VALUES (1923, ‘John’, ‘Smith’, 34),(1925,’Tommy’,’King’);
Using COPY, it will be transformed into this:
1923 John Smith
1925 Tommy King
Note that values are separated by a tab character (\t).
You may also have to convert data values to be Redshift compatible. This is because MySQL and Redshift support different column and data types.
For example, the DATE value ‘0000-00-00’ is valid in MySQL, but will throw an error in Redshift. You have to convert the value into an acceptable Redshift format, like ‘0001-01-01.’
Step 3: Import (Load)
After you’ve transformed your MySQL statement, the last step is to import it from S3 to Redshift. To do this, simply run the COPY command:
COPY users FROM 's3://my_s3_bucket/unload-folder/users_' credentials 'aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key';
Downsides of Import & Export
Although Import and Export is the simplest way to replicate to Redshift, it is not ideal for frequent updates.
For example, it takes roughly 30 minutes to export 18GB of data from MySQL over a 100Mbps network. It takes another 30 minutes to import that data into Redshift. This assumes you experience zero connection issues during import or export, which would force you to start the process over.
Incremental SELECT & COPY is a more efficient method of replicating MySQL to Redshift.
Incremental SELECT & COPY
If the Import & Export is too slow for your needs, incremental SELECT & COPY might be your answer.
The SELECT & COPY method only updates the records that have changed since the last update. This takes considerably less time and bandwidth compared to importing and exporting your entire dataset. SELECT & COPY enables you to sync MySQL and Redshift much more frequently.
To use incremental SELECT & COPY, your MySQL table has to meet a couple of conditions:
The table must have an updated_at column, whose timestamp gets updated every time the role is changed. The table must have a unique key or keys.
Like Import & Export, there are three steps to this method:
Incremental SELECT exports only the rows that have changed since the last update. The SELECT query you run on MySQL looks like this:
SELECT * FROM users WHERE updated_at >= ‘2016-08-12 20:00:00’;
Save the result to a file for transformation.
This transformation step is the same as the Export & Import method. Transform the MySQL data into TSV format for Redshift.
At this point, your MySQL TSV file includes both updated rows and newly inserted rows. You cannot simply run a COPY command straight to your destination Redshift table. This would cause the updated rows to be duplicated.
To avoid duplicate rows, use the DELSERT (DELete + inSERT) technique:
- Create a temporary table on Redshift with the same definition as the destination table.
- Run COPY command to upload data to the temporary table.
- Delete rows from the destination table, which also exist in the temporary table. It will look like this:
DELETE FROM users USING users_staging s WHERE users.id = s.id;
Where “id” is the unique key of the table.
Finally, Insert rows from the temporary table to the destination table:
INSERT INTO users (id, firstname, lastname, updated_at) SELECT id, firstname, lastname, updated_at FROM users_staging s;
Downsides of SELECT & COPY
Incremental SELECT & COPY is more efficient than Export & Import, but it has limitations of its own.
The main problem is that rows deleted from your MySQL table stay in Redshift indefinitely. This isn’t an issue if you want to keep historical data on Redshift while purging old data from MySQL. Otherwise, deleted rows in Redshift can cause major headaches during data analysis.
Another drawback of this method is that it doesn’t replicate table schema changes. When a column is added or dropped from the MySQL table, you need to make the corresponding change on the Redshift table manually.
Finally, the query used to pull updated rows from a MySQL table can affect the performance of your MySQL database.
If any of these drawbacks are a dealbreaker, this next method is for you: Change Data Capture with Binlog.
Change Data Capture With Binlog
Change Data Capture (CDC) is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It’s similar to incremental SELECT & COPY in that it only imports changed data, not the entire database.
Unlike Incremental SELECT & COPY, however, CDC allows you to achieve true replication of MySQL to Redshift.
To use the CDC method with a MySQL database, you must utilize the Binary Change Log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in Redshift.
Getting Started with Binlog
When you use CDC with Binlog, you are actually writing an application that reads, transforms, and imports streaming data from MySQL to Redshift.
There is an open-source library called mysql-replication-listener that you can use to do this. This C++ library offers a streaming API to read data from MySQL bBnlog in real time. A high-level API is also available for a couple of languages, including kodama (Ruby) and python-mysql-replication (Python).
First, set the MySQL config parameters to enable Binlog. The following is a list of parameters related to Binlog:
log_bin = /file_path/mysql-bin.log
Parameter binlog_format sets the format to how Binlog events are stored in the Binlog file. There are three supported formats: STATEMENT, MIXED and ROW.
STATEMENT format saves queries in the Binlog files as is (e.g., UPDATE SET firstname=’Tom’ WHERE id=293;). Although it saves Binlog file size, it has issues when used for replication.
For replication to Redshift, use ROW format.
ROW format saves changed values in the Binlog files. It increases the Binlog file size but ensures data consistency between MySQL and Amazon Redshift. log_bin sets the path where Binlog files are stored. expire_logs_days determines how many days Binlog files are kept.
Specify the tables you’d like to replicate in the replicate-wild-do-table parameter. Only those tables specified should go into the Binlog files.
We recommend keeping Binlog files for a couple of days. This ensures you have time to address any issues that arise during replication.
If you use a MySQL replication slave server as the source, it’s important to specify the log-slave-updates to TRUE. Otherwise, data changes made on the replication master will not be logged in the Binlog.
Also, your MySQL account needs to have the following privileges to perform replication-related tasks:
- REPLICATION SLAVE
- REPLICATION CLIENT
- LOCK TABLES
2. Export & Transformation
When you use the Binlog, “export” is really a real-time data stream of your MySQL Binlog files. How the Binlog data is delivered depends on the API you use.
For example, with kodama, Binlog data is delivered as a stream of binlog events.
Kodama lets you register event handlers for different event types (insert, update, delete, alter table, create table, etc.). Your application will receive Binlog events. It will then generate an output ready for Redshift import (for data changes) or schema change (for table schema changes).
The data change import is similar to the Transformation steps of our other replication methods. Unlike the others, however, Binlog allows you to handle delete events. You need to handle delete events specifically to maintain Redshift Upload Performance.
The final step of moving data from MYSQL to Redshift is to import your Binlog data stream.
The problem is Redshift doesn’t have streaming upload functionality. Use the DELSERT import technique we outlined in the Incremental SELECT & COPY method.
Downsides of Binlog
Binlog is the ideal method of replication from MySQL to Redshift, but it still has downsides:
Building your CDC application requires serious development effort.
In addition to the data streaming flow described above, you will have to build:
Transaction management. Track data streaming performance in case an error forces your application to stop while reading Binlog data. Transaction management ensures you can pick up where you left off.
Data buffering and retry. Similarly, Redshift can become unavailable while your application is sending data. Your application needs to buffer unsent data until the Redshift cluster comes back online. If this step is done incorrectly, it can cause data loss or duplicate data.
Table schema change support. A table schema change Binlog event (ALTER/ADD/DROP TABLE) comes as a native MySQL SQL statement that does not run on Redshift as is. To support table schema changes, you’ll have to convert MySQL statements to the corresponding Amazon Redshift statements.
If you don’t want to spend developer resources on your own CDC Binlog application, there’s a fourth method: Integrate.io
Integrate.io: MySQL to Redshift Replication in Minutes
Integrate.io’s ETL pipeline utilizes the MySQL Binlog to replicate data to Redshift in near real-time.
Integrate.io manages the entire replication process, so you don’t have to. You can even synchronize multiple MySQL databases (and other types of databases as well) to Redshift at the same time.
Setting up Integrate.io is so simple that you can do it over your lunch break:
- Enable binary logging on MySQL.
- Provide Integrate.io with access information on your servers and with Amazon Redshift.
- Recreate your MySQL tables on Amazon Redshift.
- When you’re ready, Integrate.io will replicate your MySQL data to Redshift.
- Whenever a change is made in the Binlog (data or schema change), Integrate.io will automatically replicate those changes to Redshift.
In addition, Integrate.io automatically maps MySQL data types into formats used by Redshift. This eliminates the need for your team to do this manually.
Integrate.io isn’t the only ETL-as-a-Service on the market, but it is the simplest and most reliable ETL. Our world-class, 24/7 customer support will proactively monitor your pipeline to ensure you’re always up and running.
How Integrate.io Can Help Move Data From MySQL to Redshift
You rely on MySQL to power your business, but its limitations with data analytics are well-known. Redshift provides a simple, powerful solution to your BI needs. Together, MySQL and Redshift can push your business to the next level.
As you’ve seen, there are numerous ways to replicate data from MySQL to Redshift. Methods range from simple to complex, and painfully slow to nearly real-time. The method you choose depends on several factors, such as:
- Replication frequency
- Size of your MySQL dataset
- Available developer resources
The fastest, truest replication method is Change Data Capture (CDC), which utilizes MySQL’s Binlog. The downside is that it requires developer hours to build and maintain the application.
That’s where an off-the-shelf tool like Integrate.io comes in.
Use Integrate.io for fast, reliable replication from MySQL to the Redshift database without the hassle and headache. Its ETL pipeline utilizes the MySQL Binlog and replicates data to Amazon’s data warehouse quickly. Or you can synchronize several MySQL databases to Redshift simultaneously. Integrate.io also provides over 200 data connectors that effortlessly sync datasets between data sources and destinations, such as various relational databases, transactional databases, SaaS tools, data warehouses, and data lakes.
Other benefits of Integrate.io include:
- Create no-code ETL and Reverse ETL pipelines in minutes with a drag-and-drop interface.
- Get the fastest ETL data replication in the industry and unify your data every 60 seconds, helping you create a single source of truth for reporting.
- Optimize API management with secure, self-hosted REST API code automation that powers your data products.
- Improve data observability with custom alerts and monitoring.
Take advantage of Integrate.io's 14-day free trial and get more value when moving data from MYSQL to Redshift. Or, if you want to solve unique data migration and data transfer problems, schedule an intro call with an expert. Integrate.io will identify your pain points and provide you with solutions one-on-one.