Table of Contents
In data engineering teams the list is long and depends on your personal role. But my general pick is “Poor SQL statements by end data scientists and data analysts”.
Probably not an obvious answer, so let me explain.
Let’s look at this from a perspective of what three major areas of work are for data engineering teams if you’re working with a data warehouse:
- building ETL pipelines → getting data into your warehouse
- building transformations → joining / transforming different data sets
- exposing data for downstream usage → reports, analytics, ML/AI
Data engineers also need to catalog and organize metadata, and define the processes to write and read data from the warehouse. In a way, they are the librarians of the data warehouse.
And then the goal is to abstract and automate as much as possible. By automating, data engineers can dedicate their scarce time to building vs. maintaining / fixing.
You’re also making a promise to the business by adding SLAs to the data you provide. “The reports will be done by no later than 6am PST” or “our analytics environment is only 15 minutes behind our production environment”.
And voilà, you’ve done all of the above, you put it into production, a little tuning, and everything is working fine. You can move on to something else. Well, no.
Drivers of change
Things are not static. If you’re working for a business that’s growing, then you’ll have to deal with three challenges:
- Data volume is growing, by a factor ~10x over 5 years, with more and new types of data sources emerging along the way
- The number of models is growing. As you ingest more data into your warehouse, you can combine that data in infinite new ways. You’ll hear the term “DAG” (direct acyclic graph).
- The number of users & tools is growing. As the business grows, the number of people who need/want access to data is growing. And they will want to access that data with the tool of their choice.
Here’s an example of what a DAG can look like, this one is from a venture-funded eCommerce company (taken from dbt).
Quite complex. And that’s just one DAG. So imagine more of the same.
The Challenge of Data Engineering
Now you’re the data engineer in charge of this stack. Your company will hire more data engineers to keep things running. For example, Netflix has one data engineer per data source, and their entire job is to keep the plates for that data source spinning.
But not every company has the budget of a Netflix. There’s a limit for headcount. However, there’s seemingly no limit for the number of data scientists and analysts your company is hiring. More eyeballs on data to be “data-driven”.
And so the ratio between “data builders” (the data engineers) and the “data consumers” (data analysts, scientist, machine learning apps, etc.) shoots up.
I’ve seen a ratio of (data builder) to (data consumer) anywhere between 1:20 to 1:40. One data engineer has to support 20–40 downstream users.
And that’s where the problems start. Going back to the initial three areas of work, here’s what will happen:
- ETL pipelines run for a long time and produce errors and issues. You may only find out post-runtime though, and now you have to figure out what broke. It’s a huge distraction.
- The models in place may not provide the answers the business wants. Analysts want to move fast, so they bypass you and start adding new models, or even query the raw data directly in your warehouse. That’s causing bloat and broken models if the underlying tables change.
- Your end-users may be using tools that generate the SQL for them. Or they write their own SQL statements. Both approaches may lead to poor SQL syntax that strains the entire warehouse, with slow queries for everybody.
And then users file support tickets with the data engineers (“my queries are slow”, or “my queries don’t complete or finish”). And you get inundated by support requests.
We’re of course dramatizing, but directionally, these are the three worst parts of the job. Let’s call it “keeping the plates spinning”.
The worst part of data engineering
The worst one in my book is the last point – dealing with poor SQL.
That’s because the pipelines and models are something you can control. Conventions, tooling, monitoring, alerting, access rights, etc. – there’s a way to put guardrails around things.
But it’s impossible to control the end-users and their SQL. For example, I’ve seen “SELECT *” queries without a WHERE clause, joining two tables with two billion rows each. The output is so large that it fills up and takes down the warehouse. “Who wrote that query??”.
Less dramatic results include writing queries that take e.g. 10 minutes to execute when a small change could result in a 1 minute execution time. That may not sound like a big deal (“I’ll go and get a coffee meanwhile”), but it’s a huge loss of productivity. For data science, fast iterations and testing a model is everything.
Yes, you can set rules to terminate queries, but all that does is drive up the number of support tickets your analysts’ file because queries don’t finish.
And for the data engineer, it’s not obvious who writes these queries either. The tools analysts use mask the users behind them. A dashboard tool like Tableau, Looker or Mode Analytics appears as one user in your warehouse.
But behind them, they may have 100–200 people writing queries. So you use “Looker” as the user, but you don’t know if it was “Jack”, “Anne” or “Joe” who wrote the query. So there’s a lot of digging involved to find out what happened and who wrote which query.
So there you go, above is the long version. The short version of the answer is “poor SQL statements by end users”.
It’s a problem for three reasons:
- You don’t have control over the SQL syntax that analysts write. Chances are you will only find out once a query has run and the damage is done.
- The tools that analysts use to write queries mask the users behind them. With hundreds of users, finding the one who wrote the query is like finding the proverbial needle in a haystack.
- You can’t just turn analysts off or kill their queries – it will result in an increase in support tickets and friction between data engineering and data consumers.
With the ratio of data producers to data consumers growing, the problem only gets bigger. The more end-users you have to support, the more complaints and tickets you have to deal with, and it’s a massive frustration and time suck.
The answer to the problem is, of course, to empower analysts to write better SQL, and help data engineers collaborate with the analysts on that.