Current State of ETL/ELT
Extract-transform-load, more commonly known by its street name “ETL”, has been around since the early days of computing. Bringing together data from disparate sources has served many purposes including optimizing data storage costs, enabling analytical queries on large datasets, or serving integration purposes, to name a few.
To implement ETL/ELT, IT organizations turn to the wizardry of their data engineers and their tool of choice over the last decade has often been Apache Spark, custom Java, Scala, or Python code, and a lot of computational horsepower to implement bespoke data pipelines. Though potentially a quick and immediate win, increasing an enterprise’s custom codebase invariably introduces additional challenges such as maintenance, lock-in to specific programming languages or vendors, and ultimately increasing the time-to-value for business initiatives. According to a major industry study, on average, 44 percent of [a data engineer’s] time is wasted building and rebuilding data pipelines, which connect data lakes and warehouses with databases and applications … Companies are paying large sums to reach these bad outcomes. Data and analytics leaders report a median of 12 data engineers earning an average of $98,400 a year each. Spending 44 percent of their time away from working on advanced models and analytics adds up to $520,000 a year.
To make matters worse, many of these maintenance tasks often end up being rather trivial. For example, data engineers might spend their time triaging tickets to cascade simple data model changes (such as fields that have been added or removed in respective source and target systems) just to keep the data pipelines operational and synchronized. When considering the average enterprise runs a sprawling number of applications – typically on the order of 364+ – the realm of potential impacts on the data model becomes even more dizzying. Logic changes could require even more effort, and we haven’t even accounted for regression testing and validation efforts associated with a release yet.
What if there was a happy medium that enabled enterprises to declaratively state their ETL logic in business terms while giving data engineers the same level of control afforded by custom code?
SQL + ETL = a match made in heaven?
One of the most popular and widely adopted means of interacting with data has been and continues to be Structured Query Language (SQL). Stack Overflow’s 2021 Developer survey found that SQL remains in the top 3 most popular programming languages among professional developers (50.73% of respondents used SQL in the previous year and will continue to do so in the next). When assessing annual developer salaries, SQL is much more “middle of the pack” ($56,228) – nearly half of what enterprises spend on a data engineer. A SQL-centric data pipeline not only cuts development time but also saves on costly engineering talent.
When considering the declarative nature of SQL in the context of ETL, data engineers can focus more on implementing logic, less on writing boilerplate code, and can leave the “heavy lifting” to the underlying engine. SQL tends to be far more maintainable than custom code, especially since there are more people in the industry, both technical and business-oriented, who can understand and write SQL statements. As a result of a shorter ramp-up time and improved developer experience, data-driven organizations can alleviate the maintenance burden of bespoke data pipelines.
While the future may be uncertain, two things are for sure – custom code isn’t going away, and neither is SQL – the lingua franca of data.
Aside from drag-and-drop ETL/ELT tools, the open-source Apache Spark has gained widespread adoption over the past decade for this use case and includes a SQL interface, known as Spark SQL. A common data engineering workflow to write Spark-based data pipelines may involve a few steps. First, data engineers might use an MPP query engine, such as Starburst or Trino, to prototype their ETL/ELT logic. After successfully validating their ETL/ELT SQL statements, the engineer must then undergo the arduous and error-prone task of wrapping their simple SQL syntax within a Spark job using Python, Scala, or Java. These Spark SQL queries may then need to be further optimized to ensure high performance. Altogether, these steps delay the business’s time to insight and increase time to value.
Transform Your Developer Experience: Starburst for ETL
In the past, the argument could have been made that the rewrite effort was worth the effort due to Spark’s mid-query recovery model that enabled fault tolerance. Previously, Trino/Starburst consciously made the design tradeoff of choosing speed at all costs over fault tolerance, which led to a coarse-grained recovery model in which a single machine failure resulted in a query failure. In many cases, Starburst makes ETL/ELT optional given our engine’s query federation capabilities and MPP architecture.
Today, we’re pleased to announce that the full lifecycle of these ETL/ELT workloads – from design to implementation – can be run on Starburst with fault tolerance as a result of Project Tardigrade. The implementation uses full query retries and granular task level retries, both of which ensure reliable results for long-running batch/ETL workloads from a macro and micro lens respectively. Now, developers can increase the complexity of ETL/ELT SQL statements by orders of magnitude without worrying about failures after a long execution period. Similar to interactive SQL workloads run on Starburst, batch/ETL workloads automatically benefit from Starburst’s powerful, enterprise-grade query optimizations without requiring manual, hand-tuning of SQL queries. Complex transformations can be plugged into Starburst as well in the form of User-Defined Functions (UDFs). Starburst has taken care of these optimizations and more, so you don’t have to. Data engineers will no longer have to spend their precious time on duplicative, error-prone tasks and are now free to focus on value-add efforts instead.
CREATE TABLE lake.burst_bank_v2.customer_accounts AS
SELECT c.first_name, c.last_name, c.estimated_income,
a.products, a.cc_number, a.mortgage_id
FROM mongodb.burst_bank_large.customer c
JOIN mysql.burst_bank_large.account a ON a.custkey = c.custkey
Above: Simple ETL SQL statement using Starburst connectors to extract and transform data from operational non-relational and relational stores then load to a data lake.
Optionality is the name of the game with Starburst. By default, fault tolerance will not be enabled and can be configured in the cluster settings. When considering the deployment topology of Starburst in a customer’s VPC or data center, a best practice is to separate ad-hoc interactive workloads and batch/ETL workloads into their own dedicated clusters. The separation of concerns enables adherence to strict SLAs for interactive workloads and business intelligence use cases that will not be compromised by long-running ETL/ELT executions.
Need for Speed
So what about the performance impact of introducing query fault tolerance for long-running ELT/ETL workloads? After running comprehensive TPC-H and TPC-DS ETL benchmarks, we found that query fault tolerance makes what was previously impossible, now possible. Seven of the queries (q1, q3, q5, q9, q13, q18,q21) failed to complete in the original mode but ran successfully with the Tardigrade enhancements. Similarly, the benchmark demonstrates that the performance tradeoff is negligible given the relative parity of results between the Tardigrade and original modes of execution.
Above: Wall Clock times of TPC-H ETL (10 TB) benchmark. “Tardigrade” indicates query fault tolerance settings enabled.
Let’s double click on query 9 – an example query that failed with an out-of-memory (OOM) exception in the original streaming mode. The available memory was purposefully limited to 61.6 GB per node. In contrast, the same query can successfully run on identical hardware when query fault-tolerance was enabled and completed in 13m 50s in the resource-constrained environment. Instead of spending countless hours of a data engineer’s valuable time tuning queries and hardware, a few simple cluster configuration settings can enable your team to focus on creating differentiating value for the business.
Above: TPCH-ETL 10TB Query 9 syntax
In addition to being optimized for the desired batch/ETL workload, the Starburst ETL cluster can leverage the power of Starburst Stargate* to localize compute close to the center of data gravity to minimize cloud egress charges and stay true to the single point of access paradigm by allowing data engineers to invoke the remote Starburst ETL cluster from their “home base” — the interactive Starburst cluster that they already know and love. Infrastructure teams can rationalize their technology stacks too and focus on running use case optimized Starburst clusters instead of supporting a smorgasbord of big data frameworks. Better yet, companies can offload their infrastructure management entirely to Starburst with our fully-managed enterprise-grade SaaS platform, Starburst Galaxy. Data engineers and data analysts now have a one-stop shop for their respective data pipeline requirements and interactive analytical needs.
Are you building or patching a data pipeline, ETL/ELT, or Spark job soon? Consider taking Starburst for a spin and you’ll be in good company. Ready to take back control of your data stack? Reach out to your Starburst expert today to access the Starburst Enterprise limited preview. You can also enable batch clusters with a click of a button in Starburst Galaxy today. Get started right now!