How data engineering fails

What we would do to prevent it and what we would do if that actually happened

Last Updated: January 10, 2024

Benn Stancil, in his Datanova talk, How data engineering fails, delves into the potential pitfalls that data engineering might face. Drawing inspiration from a movie, World War Z, Stancil emphasizes the importance of considering scenarios where widely accepted ideas or tools in the data industry might fail. He applies this thinking to popular tools like Snowflake, Fivetran, and DBT exploring how they might not be the inevitable successes they are perceived to be.

6 challenges in data engineering

1. Data engineering is boring

Stancil begins by discussing the possibility that data engineering fails because it becomes a monotonous and unappealing job.

He highlights the parallels with the past declaration of data science as the “sexiest job of the twenty-first century” and the subsequent realization that a significant portion of the work involves mundane tasks like data cleaning. 

In a 2012 study, HBR characterized the 5 different jobs a data scientist has to do: (1) discovery, (2) wrangling, (3) profiling, (4) modeling, and (5) reporting. Unfortunately, the first three are where data scientists spend most of their time, and this isn’t what data scientists want to do. 

Also, in a 2014 New York Times article, it identified a key issue with data science—termed as “janitor work.”

Janitor work describes the less glamorous aspect of data science, involving tasks like data wrangling, data collecting, and data cleaning which occupies a significant portion of data scientists’ time (50%-80%). 

The irony is that data engineering, which encompasses a lot of the less glamorous aspects, is now hailed as the new cool job.

2. Data engineers all get fired because they cost too much 

The second potential failure scenario involves data engineers getting fired due to their high costs. Stancil points out that data engineers often earn more than software engineers in San Francisco and the additional expense of buying costly tools can make the role economically challenging. 

He supports this by showcasing examples from data conferences where companies reveal the expenses associated with their data stacks.

In a Miro slide deck, they outlined the tools that they were using in their current data stack.

These are the ones that you’d have to actually pay for. And that are strictly data tools if you remove things like GitHub, which might be shared with an engineering team. In this case, roughly, these tools probably cost between $50,000-$100,000 a year. That’s on top of the salary of a data engineer.

The next slide comes from Conde Nast. 

If you’re a much bigger enterprise, you have more tools. These are the tools you’d probably be paying for around $250,000-$1million per year. And that’s not cheap. 

In 2023, the year of efficiency, organizations may start thinking twice about spending so much, both as tools, and as employees.

3. High costs and tool expenses lead to data engineering layoffs

Moving on, Stancil questions the actual value provided by data engineering. He explores the notion that data engineering might fail because, despite the investment, the work done doesn’t prove to be as useful as expected. Self-serve analytics, which was once considered a game-changer, is criticized as possibly not delivering the expected results.

Additionally, in 2013, Forrester said that real time architectures will become more prominent.

Except: We’ve been a year away from delivering on the promise for real-time analytics for the last decade. 

And you can follow a similar timeline for valuable features that are really important about data engineering. 

Predictive analytics

We have the same story for predictive analytics where it’s going to be operational in 2014. Again, we’ve been doing this for eight or nine years saying this is gonna happen, and we’re still not quite there. 

Machine learning and AI

Also, for machine learning and AI, these timestamps tell the same story where we’ve developed technology that we say are gonna be really valuable, but they’re just not quite here yet.

Headlines urge legacy companies to become more data-driven and to do this very quickly. And yet, again, we never quite seem to get there.

If data engineers are asked what they do, they’d say: “We spend a lot of money and create a little bit of value.” 

Stancil advocates that if we want data engineers to not fail, we have to have a better answer than that.

4. Data engineers get replaced by tools

The fourth point considers the potential replacement of data engineers by tools.

Stancil reflects on the evolution of data stacks and how contemporary tools can substantially reduce the need for a large number of data engineers. The trend of increasing automation and the availability of off-the-shelf solutions may lead to a reduced demand for traditional data engineering roles.

In an example, the first tech role that Stancil held at Yammer. They built a data stack there that looked similar to the modern data stack with about 10 data engineers that supported their system for a company of four to five hundred people.

If you rethink the data stack and what it looks like today. The company would buy Fivetran to manage these pipelines and replace both of the databases with a tool like Snowflake

….Given where we are today, replace it with something like Starburst.

Then use a transformation tool such as DBT to run these transformations. And so all of this would take about 1-2 data engineers.

5. Decentralization and the rise of data mesh

Stancil then discusses the decentralization trend in data management, as highlighted by concepts like data mesh and data contracts. 

If this trend continues, data engineering might not disappear but may lead to a reduced ownership of business problems. Instead, they are expected to assume a more supportive function for teams directly managing data processes. 

While data engineering won’t fail, it foresees a transformation into a role resembling infrastructure maintenance, akin to a Database Administrator (DBA) or an IT team.

Despite being crucial, this evolution implies a decrease in autonomy and strategic involvement for data engineers, aligning with the broader industry trend of decentralization. If this trend persists, more responsibilities may be transferred to business partners rather than being retained within the data engineering team.

6. Data engineers are automated by AI

Lastly, Stancil explores the possibility of data engineering being automated away by AI. He introduces the idea that if AI-driven systems become the primary consumers of data instead of humans, data formats may shift towards being machine-readable rather than human-readable. This shift could potentially render certain data engineering tasks obsolete.

The need for creativity in data engineering

Amidst these potential shifts, we need creativity in addressing the evolving landscape. Drawing parallels to movie studios stuck in remakes, the data engineering community is urged to break free from repetitive problem-solving and focus on understanding and solving real-world problems.

Beyond technology: a human-centric approach

Data engineering lies not in technological advancements alone but in understanding the problems faced by the people for whom solutions are crafted. Emphasizing a shift from technology-centric approaches to problem-centric ones, the call is to listen less to data professionals and more to the concerns and needs of the broader community.

The future of data engineering and AI

The future of data engineering in the age of AI demands a shift in perspective. Rather than fixating on technological solutions, the focus should be on empathetic problem-solving, aligning with the challenges faced by various departments. By listening more to the broader community and fostering creativity, data engineers can navigate the evolving landscape successfully.

Here are a few blogs Stancil says you should read:

Discover. Govern. Analyze.

Easily search across data sources and clouds to find the data you need.
Streamline data governance with built-in RBAC and ABAC.
Run internet-scale workloads with the power of Trino.

Start today

Start for Free with Starburst Galaxy

Up to $500 in usage credits included

Please fill in all required fields and ensure you are using a valid email address.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.