has been working with companies of different sizes - a fintech startup with a 20 strong team and an e-commerce company with 500 employees. Some have 1-2 engineers & others have elaborate data teams with data engineers, data scientists, and data analysts.

This post is my attempt to guide some early teams with tight engineering resources. If you have growing data, you need to view it in one place, analyze it effortlessly and make decisions based on a single source of truth. Here is a super-mini crash course on what goes into it:

What is a data stack?

1. Data Sources

a. Your Transactional Database

Every business has a database where all your transactions, visits, and customer information is stored. It is commonly known as a transactional database. Some popular transactional DBs include MySQL, PostgreSQL, MongoDB among many others.

If this database is growing, then so is your product usage/site visits. Often, it becomes hard to query or analyze this database directly. Why? Because this database is meant for transaction storage and not for analytics. If you try to run analytical queries on top of this DB, either they will take hours to run, or they might bring the whole database down.

b. Your SaaS applications

These include your CRM, Google Analytics, Adwords, Facebook Ads, Customer Support system, marketing automation, and more.

Most of us are analyzing this data in silos - on their respective dashboards. But, you could gain more insights by putting all of this in one place along with your transactional (database) data and run queries on them together.

You need to know if an ad campaign is working not just by seeing the point in time conversion, but also what the lifetime value of that customer is i.e., How many orders did that customer place since conversion and what has been the profitability to date?

2. Data Warehouse

Remember the problem with running analytics directly on top of your transactional database? A data warehouse solves that problem. As the word “warehouse” suggests, this is the best way to store your data specifically for analytics. It is an efficiently designed database for analytical queries.

Some of the most popular data warehouses are Amazon Redshift, Snowflake, Google BigQuery, and there is also a free & open-source one called ClickHouse.

3. Transforms

Some data needs to be found and edited to compose the correct queries. Since all the data is ingested in its raw format, some things may need to be connected to filter for the right answers. There are tools like dbt that help you “transform” your data once it is in the data warehouse. The need and logic to transform depending on a case-by-case basis.

4. Visualize and Analyze

Once you have the underlying engine with real-time, accurate data ready to go (on your warehouse), you need the dashboard view, and the ability to query and filter things you want to see.

This can be done on many Business Intelligence (BI) tools. Some popular ones include Mode, Looker, PowerBI & Tableau. You can pull reports and run queries. Understand what your data is telling you.

All these 4 steps,

Source >> Data Warehouse >> Transform >> Visualize = your modern data stack.

Why do I need a data engineer?

For precisely the above. You need a data engineer to stitch up your data stack, and more importantly, maintain and manage it. Large companies build custom internal tools for all the above, so they have larger data engineering teams.

However, there are scalable tools for every step now available as plug-n-play, ready for off-the-shelf deployment. If you are just getting started, you need to set up these tools, and they take you a long way for a fraction of the cost of a data team.

What does have to do with all of this?

If you don’t have a data engineer on your team, we provide a white-glove service to stitch-up your data stack, for the cost of just 1 week’s engineer salary.

What will all these tools cost me regularly?

That completely depends on the size of your data + amount of computes/queries/changes every month. The components are:

  1. Source connectors to Data warehouse - the pipes (starts at ~$99 per month)
  2. Data warehouse - Redshift, Snowflake, BQ (starts at ~$199 per month)
  3. Visualize - several tools here (starts at free)

All the above tools have public pricing pages. The hidden part of the equation is your data size. So we need that as input to design a budget for you. I know the above is not very specific or predictable till we understand the state of your current data.

Nonetheless, an accurate view of all your data should have an immense ROI compared to the cost of your monthly data stack. We can also pick completely open-source solutions if you do not have any budget for any of these tools.