Overview

With the increasing use and discussion surrounding ChatGPT and its applications, I decided to test out what it says about important skillsets for data engineers. I conducted a search about both soft and hard skills and here is what it came up with. I have added a lot of commentary to each of the 13 skills identified. 

Skills identified range from general technical expertise and problem-solving skills, which would fall into the generalist category of any data related position, and more focused skills related to data pipelines database management, and devops. The point is, data engineering requires the ability to manage and execute projects, are require several soft skills to ensure collaboration across teams and an understanding of the business needs that drive data pipelines. Many of these skills overlap with what is required for data scientists and overlaps with machine learning needs. The reality is that many of these skills are required across data engineer jobs, the data engineer skills, or data analytics needs across the organization.

The gaps between software engineers, data scientists, data engineers, machine learning engineers, data analysts, etc., are blurring. As organizations continue to use generative ai, the need for data skills and data science jobs will become more important within organizations.

Breaking down ChatGPT's 13 top skills

Here are the top skills identified. To me, more important than what ChatGPT thinks, is what data engineers actually use on a daily basis and what skills are most valuable.  

Some of the most important skills for a data engineer to have include:

  1. Strong problem-solving skills: The ability to troubleshoot and solve complex technical problems related to data integration, data quality, and data performance is important for a data engineer. Any technical roles require the ability to solve challenges as projects experience unknowns and people need to be able to react to change and shifts in scope. 

  2. Experience with data pipeline and ETL: Experience with creating, maintaining, and optimizing data pipelines, including Extract, Transform, Load (ETL) processes, is crucial for a data engineer. Data engineers also need to understand use cases and applications for ELT, CDC, real-time data streaming, API generation and how to integrate diverse data integration processes across the data ecosystem.

  3. Cloud computing: Knowledge of cloud computing platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure is increasingly important for data engineers as more organizations are moving their data and data pipelines to the cloud. Most organizations have a hybrid environment and will continue to adopt cloud platforms. Therefore, it's important to understand the key differentiators among platforms to make sure that storage and analytics choices are the most applicable for the business need.

  4. Data modeling and data warehousing: Experience with data modeling, data warehousing, and data governance is essential for data engineers to ensure that data is properly structured and stored for easy access and analysis. In my opinion, the importance of each should be its own category. Data Governance requires a framework and buy-in from stakeholders across the organization. The reality is that most organizations will need data engineer involvement but require a broader set of involvement and management to ensure success. 

  5. Understanding of data security and compliance: Understanding of data security and compliance requirements is important for data engineers to ensure that data is protected and that the organization is compliant with regulations such as GDPR, HIPAA, and SOX. 

  6. Strong communication and collaboration skills: Data engineers need to be able to communicate effectively with other members of the data team, as well as with stakeholders from other departments, to understand their data needs and provide solutions. Security, compliance, governance, data management, etc. require a level of collaboration across departments to ensure solutions are implemented to address the right challenges being faced.

  7. Continuous learning: Data engineers must keep up with new technologies and techniques as the field is rapidly changing. Hopefully, an organization supports this through education, conferences, certifications, etc. An organization should be committed to constant improvement to ensure that people and projects are successful. 

  8. Project management: A data engineer should have experience in project management as they will be responsible for designing and executing data integration projects, ensuring that they are delivered on time and within budget. Many organizations no longer employ formal project managers within IT, so data engineers may need to manage their own projects and ensure that timelines are met to ensure project success.

The following 5 skills are pretty self-evident. 

  1. Database management and SQL: Proficiency in managing and querying relational and NoSQL databases. 

  2. Big Data technologies: Familiarity with big data technologies such as Hadoop, Spark, and MapReduce.  

  3. Programming languages: Proficiency in at least one programming language such as Python, Java, Scala, or R.

  4. Data storage and retrieval systems: Knowledge of data storage systems such as HDFS, S3, and data retrieval systems such as Apache Cassandra, Apache Kafka, and Apache Solr.

  5. DevOps: Knowledge of software development practices and tools for deployment, automation, and testing.

Takeaways

All of these skillsets are important and transferrable within the data space. With the increasing use of artificial intelligence, chatbots, search engines, etc. organizations will rely on openai more, but these don’t necessarily define the successful outcomes of skills within these roles.