How Netflix Built Their Data Pipeline with Amazon Redshift

Netflix

The data infrastructure at Netflix is certainly one of the most complex ones, having in mind that they serve over 550 billion events per day, equaling roughly to 1.3 petabytes of data. In general, Netflix’s architecture is broken down into smaller systems, such as systems for data ingestion, analytics, predictive modeling etc. The data stack employed in the core of Netflix is mainly based on Apache Kafka for real-time (sub-minute) processing of events and data. Data needed in the long-term is sent from Kafka to AWS’s S3 and EMR for persistent storage, but also to Redshift, Hive, Snowflake, RDS and other services for storage regarding different sub-systems. Metacat is built to make sure the data platform can interoperate across these data sets as a one “single” data warehouse. Its task is to actually connect different data sources (RDS, Redshift, Hive, Snowflake, Druid) with different compute engines (Spark, Hive, Presto, Pig). Other Kafka outputs lead to a secondary Kafka sub-system, predictive modeling with Apache Spark, and Elasticsearch. Operational metrics don’t flow through the data pipeline but through a separate telemetry system named Atlas.

Sources:

Big Data

How Netflix Built Their Data Pipeline with Amazon Redshift

Netflix

Mastering SQL Queries in Excel

ETL Developer vs Data Engineer: Key Differences

How Enterprise Automation Transforms Workflows

Solutions

Support

Company

Language

How Netflix Built Their Data Pipeline with Amazon Redshift

Netflix

Related Readings

Mastering SQL Queries in Excel

ETL Developer vs Data Engineer: Key Differences

How Enterprise Automation Transforms Workflows

Subscribe To The Stack Newsletter

Solutions

Support

Company

Language

Subscribe To
The Stack Newsletter