It's clear that relational databases are here to stay, no matter how popular the NoSQL paradigm becomes. They're extremely common, integrate well with other information systems, and do a good job of querying structured data—plus, people already know how to use them.
Some have argued that adding Apache Hadoop to the organization means death to the good old relational database management system, but that's far from the truth. Not only can Hadoop and relational databases work in parallel, but also together with one another.
In this blog post, we'll dive into some of the basics about Apache Hadoop, explore why and how it can work into some of the "traditional" big data stacks within an organization, and look at the different "pitfalls" that can come up.
Table of Contents
- Why Integrate Apache Hadoop with Your RDBMS
- Integrating Hadoop with Relational Databases using Apache Sqoop
- Apache Sqoop: What You Need to Know
- Is Sqoop the Right Tool For You?
- Integrate.io and Apache Hadoop
Why Integrate Apache Hadoop with Your RDBMS?
First, let's take a step back and examine one of the big-picture questions: why integrating Apache Hadoop with a relational database might be necessary. Below, we'll discuss multiple reasons that might play into your rationale.
Hadoop is great for batch processing big data in different formats and from various sources, including relational databases (to do so, however, you'll need a way to import structured data into Hadoop). In particular, most data warehouses are based on relational databases, which is relevant when you want to export your aggregated data from Hadoop into a data warehouse.
Running analyses on big data can get highly resource-intensive. Instead of performing these operations on the data as it sits in a relational database, it's much more efficient to do it in the Hadoop Distributed File System (HDFS) in a Hadoop cluster. By integrating Hadoop with your relational databases, you'll improve the scalability and performance of your big data workflows and environment.
Another use case is using Hadoop’s HDFS as cheap storage for archived data. You could pull this data from a relational database from time to time and then restore it back to the database when necessary. According to the BI consulting firm StatSlice: "Hadoop’s cost per terabyte is much less than high-end data warehouse database servers like Teradata and Oracle’s Exadata."
For information on Integrate.io's native Hadoop HDFS connector, visit our Integration page.
Since it's a distributed file system, Hadoop lacks some of the functionality that we've come to expect from relational databases, such as indexing, query optimization, and random data access. But Hadoop compensates for this fact by offering other benefits—such as efficiently processing massive quantities of unstructured data. Combining Hadoop and relational databases gives you the best of both worlds for your big data workflow and ecosystem.
Integrating Hadoop with Relational Databases using Apache Sqoop
Apache Sqoop is a highly useful tool for integrating Apache Hadoop with an RDBMS. Sqoop connects to databases like MySQL, PostgreSQL, or Oracle via JDBC, the standard API for connecting to databases with Java. It also works with non-JDBC tools that allow bulk import or export of data, but with certain limitations (e.g. no support for certain file types).
Sqoop works by running a query on the relational database and exporting the resulting rows into files in one of these formats: text, binary, Avro, or SequenceFile. The files are then saved on Hadoop’s HDFS. Sqoop also works in the opposite direction, letting you import formatted files back into a relational database. To take advantage of parallelism and improve fault tolerance, Sqoop uses the MapReduce paradigm to import and export data.
Apache Sqoop: What You Need to Know
Before you commit yourself and your organization to using Sqoop, there are a few potential issues you will need to know about.
Issue #1: Sqoop 1 and Sqoop 2
Sqoop provides two different versions of the tool: Sqoop 1 and Sqoop 2, which are incompatible with each other. Whereas Sqoop 1 runs from the command line, Sqoop 2 has a web application and a REST API. However, Sqoop 2 has since been deprecated, so you shouldn't use it in production.
Issue #2: JDBC Drivers
Sqoop works best with relational databases that support JDBC. This means that you need to install the right drivers on every machine in the Hadoop cluster. Using JDBC connections can also come at the cost of sacrificing performance when compared with other alternatives.
Issue #3: Data Types
Different databases support different data types. These data types convert to Java types when exporting with Sqoop, which is a conversion that may not always run smoothly.
For example, delimited text files could have problems saving the float data type in string format, UTF encoding formats might not convert correctly, and NULL values might cause problems. Also, Sqoop might not recognize custom data types supported by your database. In this case, you need to control type mapping via the command line or write custom Java code to map data correctly.
Issue #4: Multiple Processes
Sqoop exports data from the database in parallel using several processes. This is great since it speeds conversion, but it also means the database may be under strain while Sqoop is working its magic. Fortunately, you can configure the number of map tasks to control the number of processes used.
Also, you should set a table column to split ranges of rows across Sqoop’s processes. Otherwise, the data may not get exported efficiently, since several processes may try to handle the same rows. As long as the relevant column is a primary key, this issue won’t happen.
Issue #5: Processing on Hadoop
Once the data moves out of the relational database into files on HDFS, how does Hadoop process it? Sqoop integrates with HCatalog, a table and storage management service for Hadoop that allows for querying the files imported via Sqoop with Hive or Pig. To use Sqoop, you'll need to familiarize yourself with the HCatalog tool—which will come in handy, since you can use it to create a consistent interface between Hive, Pig, and MapReduce.
Issue #6: Getting Help
Like many free and open-source tools, Sqoop doesn't come with built-in support. Despite the healthy and sizable Apache community, it can be difficult to get the help you need, especially when you need it in a hurry. Finding a Hadoop vendor who can offer a support plan, or hiring the right in-house technical talent, could be a very wise investment.
Is Sqoop the Right Tool For You?
So, we come to the final question: is Sqoop the right tool for you and your organization?
The bottom line: Integrating relational databases with Hadoop may be necessary to get data flowing in your organization. Although it needs some fine-tuning and has a few limitations, Sqoop is usually the right tool for your database jobs.
Integrate.io and Apache Hadoop
If used correctly, Apache Hadoop can become a powerful part of your data integration process as part of a comprehensive ETL (extract, transform, load) solution. That's where Integrate.io comes in. With the insights gained from Apache Hadoop (and others), Integrate.io's cloud-based ETL solution delivers the simple, powerful visualized data pipelines your organization needs. With Integrate.io, you can transform, normalize, and clean data, all while adhering to compliance best practices.
Ready to see how Integrate.io can help with Hadoop—and more? Contact us to set up a demo and a free trial of the Integrate.io platform.