Folks have technically been doing some form of ETL since the dawn of writing. If you were a land-owner in the outskirts of Cairo, the Pharaoh had one of their tax collectors come and assess your land and assets, write them down, and calculate how much taxes you would owe.
That tax collector would then store that information in a safe location and continue about their business. When the Pharoah deemed it was tax-time again, that tax collector would EXTRACT their notes on your property, TRANSFORM it using a calculation of your old assets and new ones (and the tax rate this season), and LOAD your new taxes owed onto the papyrus. The ETL process evolved as technology evolved.
In modern times, many ETL processes can be done with computers. In fact, you can even do them in a single database! Here we’ll show you how to extract, transform, and load data using PostgreSQL.
For more information on Integrate.io's native PostgreSQL connector, visit our Integration page.
Table of Contents
You can learn about Integrate.io's Postgres integrations here.
The "Extract" step of an ETL pipeline is the process of getting data from a data source. Since PostgreSQL is a database, it can be the data source from which we extract data. One way to extract data from a Postgres database is through a query:
SELECT * FROM some_table WHERE some_value > 10;
This will output the data to the console where you run the query. This could be useful if you need to view a specific segment of your data.
You can also run a SQL statement like the one above inside another piece of code. For example, if you were writing code in Python you could execute the following:
my_data = cur.execute("SELECT * FROM some_table WHERE some_value > 10").fetchall(
The above line of Python will store the results of the query in the variable "my_data". Now, you can examine and use the data in your Python.
The last form of extraction we'll discuss is a data dump. This is where you store the specified data in an output file. It is useful for backing up a database (you can even specify your output file to be in SQL syntax, and you'll have your entire database stored as CREATE/INSERT commands inside a SQL file).
To dump a database, run the following in your terminal:
pg_dump -U some_user -W -f backup_file.sql -t some_table your_database
Let's break this down real quick:
- pg_dump: run the pg_dump command
- -U some_user: log into Postgres using the "some_user" user
- -W: ask for a password to log into the database
- -f backup_file.sql: store the output in "backup_file.sql"
- -t some_table: only grab data from the table "some_table"
- your_database: run all of the above on the "your_database" database
Now that we've covered a few ways to extract data from a Postgres database, we can explore transforming data in PostgreSQL
There are many ways to transform data, depending on what kind of data it is and what you need to do with it. We can do transformations on the data outside of the database in application code, but it can be good practice to do some transformations within your database as well. For example, if you want to grab all of your user's last names from a table and make them all uppercase, you can do that within your SELECT statement:
Other useful string functions include LOWER (make all characters lowercase), TRIM (remove white space at beginning and end of string), and CONCAT (combine two strings).
If you would like to do a numerical transformation, PostgreSQL has many built-in tools for that. For example, if you wanted to find the average age of all of your users, you can use the AVG function:
SELECT AVG(age) FROM users;
Or perhaps you want to know your total revenue in 2020:
Other helpful numerical functions include MIN, MAX, and SQRT.
You can also perform transformations with dates. For example, if you wish to know how old your users are, you can use the AGE function (which is essentially date_1 - date_2). Here's an example:
SELECT AGE(birthdate) FROM users;
Now that we've examined extracting and transforming data in PostgreSQL, let's discuss getting your data into the database (the LOAD step).
You can load your data into a database via the SQL INSERT command:
INSERT INTO your_table COLUMNS (column_a, column_b) VALUES (value_a, value_b);
To insert multiple values, just add a comma and another set of values:
INSERT INTO your_table COLUMNS (column_a, column_b) VALUES (value_a_1, value_b_1), (value_a_2, value_b_2), etc;
You can run these INSERT commands from the SQL console, or from some application code like Python. In addition to inserting data via SQL, you can do bulk loading of data using the pg_restore feature. We run this function from the command line (outside of a Postgres console).
It is the corresponding command to the pg_restore command we discussed earlier. In other words, unless you format your file exactly like a pg_restore'd file, you may run into some issues loading it. Here's how you load a file using pg_restore:
pg_restore -U some_user --data-only -f backup_file.tar
We'll break down this command:
- pg_restore: this is the function we're running
- -U some_user: the "-U" is short for "--username". In another word, we're saying "sign me in as the 'some_user' user"
- --data-only: this command assumes that the schema/table we're writing to already exists, and we want to insert data into that table. If you do not include this in your command, it will try to create the schema and table structure required to load the data you're about to load.
- -f backup_file.tar: here we specify which file contains the data we wish to load.
Putting It All Together
Now that we're familiar with extracting, transforming, and loading data with PostgreSQL, we can tie this all together and move data from one Postgres schema to another:
SELECT id, INITCAP(first_name), CONCAT(UPPER(LEFT(last_name, 1)), '.'), AGE(birthdate),
SUM(sales) INTO clean_data.my_sales_team FROM raw_data.sales GROUP BY first_name, last_name, birthdate;
- INITCAP(first_name): Capitalize the first character and make the remaining characters lowercase
- CONCAT(LEFT(last_name, 1), '.'): Take the left-most character from the last_name column (i.e. take the first letter of their last name), make it uppercase, and add a period after it. For example, "Baggins" would turn into "B."
- AGE(birthdate): Convert their date of birth into their age in year
- SUM(sales): Add all of their sales
- INTO clean_data.my_sales_team: We specify that the output of this query should be inserted directly into the "my_sales_team" table in the "clean_data" schema
- FROM raw_data.sales: This specifies that we wish to grab data from the raw_data schema's sales table
- GROUP BY first_nmae, last_name, birthdate: since we're combining values across rows for the SUM calculation, we need to group by the remaining columns.
How Integrate.io Can Help
Understanding and executing a successful ETL strategy can be complex. There are ways to build ETL pipelines using tools that require very little coding. If those tools don’t cut it, and you wish to make complex calculations using your favorite programming language, you may need to incorporate a workflow management tool like Airflow or Luigi. For a lot of scenarios, that can be overkill, and you simply want to do all of your work within your database. This is where PostgreSQL truly shines. Since it is a database, it is way more efficient at processing data than applications that use a database.
At Integrate.io, we're data people. If you would like to learn more about our services and offerings, schedule a call now.