As the organization grows, so does its data and the complexity of that data. The ability to leverage that information is key to remaining competitive. Catherine Vlaeminck, Vice President of Worldwide Marketing at Infinidat puts it best,  “Storage consolidation reduces costs, improves agility, and lessens the complexity of dealing with too many vendors. To put it simply, it is liberating.” Amazon Redshift is a data warehouse is a solution for consolidating massive amounts of information from across the organization. With the information stored in one location, leaders can use it for querying and analysis. This data is critical for informed decision-making. In this article, we'll answer questions such as "how does AWS Redshift work?", "why is it ideal for large data sets?" and "what are its pricing options?" 

Table of Contents

  1. What is Redshift and How Does AWS Redshift Work?
  2. How Does AWS Redshift Work For Large Data Sets?
  3. How Does AWS Redshift Work? Key Architecture Components
  4. How Does AWS Redshift Work Getting Data Into The Warehouse
  5. Hows Does AWS Redshift Work: Pricing
  6. How Integrate.io Can Help

What is Redshift and How Does AWS Redshift Work?

According to Gartner, Information as an asset is still in the “early adoption” phase, which makes it a competitive differentiator for leading organizations as they focus on digital transformation. In turn, data and analytics become strategic priorities. Thus, the need to store and manage that information efficiently is more important than ever.

Amazon Redshift is a cloud-native data warehousing platform from Amazon Web Services (AWS). Redshift shines in its ability to handle huge volumes of structured and unstructured data in the range of exabytes. It is also capable of performing high-performance batch analysis of large datasets. Using Redshift allows companies to “leverage the economics of cloud elasticity, benefiting from the pace of innovation in sync with public cloud providers, and more,” says David Smith, Distinguished VP Analyst, Gartner. 

How Does AWS Redshift Work For Large Data Sets?

Outstanding performance is the one factor that makes this platform so ideal for large datasets. This robust warehouse implements many features that make it outperform other options.

Massively Parallel Processing

Complex queries are typically plagued by slow response times. The platform uses Massively Parallel Processing (MPP) to enable the fast execution of complex queries. This approach entails multiple compute nodes handling query processing where each node handles executing queries on different portions of the entire information set. The results are then aggregated as the final step. 

Data Compression

AWS Redshift compresses the information in the system to reduce storage requirements and reduce disk I/O. When a query is initiated, the compressed information is read into memory and uncompressed during query execution. Loading less information into memory leaves memory available for other processing tasks.

Result Caching

Result caching is the process of storing certain types of queries in memory for faster access. When a query is submitted, Redshift checks the cache to determine if there are results for that query in the cache. If so, those results are returned and the database does not run the query. 

Workload Management

Another common challenge with traditional databases is managing the workload so that short quick-running queries don’t get stuck waiting on longer-running queries to complete. Redshift solves this problem by creating query queues. The warehouse uses defined configuration parameters for each of the queues, which determine the priority of items in the queue. 

How Does AWS Redshift Work? Key Architecture Components

Redshift's superior performance is due in large part to its architecture that takes advantage of multiple worker nodes and a high-bandwidth network connection between nodes. 

Nodes

Nodes represent the worker components of the warehouse. Each node has its own dedicated CPU, memory, and attached disk storage. There are two types of nodes in the platform, Leader, and Worker. The leader node coordinates work among a series of worker nodes.

Node Slices

A compute node is partitioned into slices that each receive a portion of the node’s memory and disk space. Each slice handles a portion of the workload assigned to the node. The slices work in parallel to handle the work to which it has been assigned.

Clusters

A cluster consists of one or more nodes. If there is more than one node, one is assigned a leader to distribute the workload.

Connections

Client applications communicate with the warehouse by using database drivers for JDBC and ODBC drivers for PostgreSQL.

Internal Network

Compute nodes run on an isolated network and can not be directly accessed by clients. As such, the platform can speed communication between nodes by using a high-bandwidth internal network and custom communication protocols. 

Database

The database is stored on compute nodes. When a query is executed, the SQL client communicates with the leader node which in turn distributes the work to the worker nodes.

How Does AWS Redshift Work Getting Data Into The Warehouse

Getting information into and out of this robust database is relatively simple thanks to its support for many connectors and integrations. 

Moving Information Between Amazon Redshift and Amazon S3

Companies already using Amazon Simple Storage Service (Amazon S3) can easily move information into Redshift. AWS uses parallel processing to load information from files stored in S3 buckets. Users can also export data from the warehouse to Amazon S3 buckets.

Amazon Redshift with Amazon DynamoDB

DynamoDB is an Amazon NoSQL database service that integrates easily with the warehouse. Users can use the COPY command to load information from DynamoDB into the warehouse.

Importing Information from Remote Hosts over SSH

The COPY command can also be used to load data from other sources such as Amazon EMR clusters, Amazon EC2 instances, or other external sources. The COPY command uses an SSH connection to the remote host and the information is loaded in parallel.

Extract, Transform, Load (ETL)

Users can automate data transfer and transformation into and out of the platform using another Amazon service known as AWS Data Pipeline. This service allows the scheduling of recurring jobs to handle complex transformations and loads. Companies can also use external ETL tools such as Integrate.io to build robust data pipelines in AWS redshift.

Hows Does AWS Redshift Work: Pricing

The warehouse offers exceptional pricing flexibility that empowers companies to better predict and control spending. However, the pricing tiers can be a bit difficult to understand. Below is a brief explanation of each of the options:

On-Demand Pricing

This is the simplest pricing tier whereby the company only pays for the capacity they use by the hour. There are no commitments or upfront costs. The rate is based on the number of nodes in the cluster. Partial hours are billed in one-second increments. Any time the warehouse is paused, companies only pay for backup storage.

Concurrency Scaling Pricing 

Redshift automatically scales to ensure the best possible performance of the cluster. There are no upfront costs for this. However, companies are charged a per-second on-demand rate for a Concurrency scaled cluster used over the free credits allocated to the company’s account.

Amazon Redshift Managed Storage Pricing

Managed storage charges for storage at a fixed GB per month rate depending on the region. Usage fees are calculated hourly based on the total amount of data stored. The charges do not include backup or storage charges due to automated and manual snapshots.

Reserved Instance pricing

Companies with steady-state production workloads can benefit from a reserved instance. With reserved instances, the pricing is standard and there are no variable on-demand costs to deal with. This option requires a one or three-year commitment.

How Integrate.io Can Help

If you are looking to integrate your data into Amazon Redshift, Integrate.io has a standard integration with the warehouse that helps you get up and running quickly. This integration requires no code and anyone regardless of technical knowledge can build a robust pipeline in a matter of minutes. Are you ready to see what benefits the Integrate.io API can bring to your company? Contact our team today for a 7-day trial and see how we can help you reach your data integration goals.