5 ways to simplify ETL using SQL

October 31, 2024

Evan Smith

Technical Content Manager

Starburst Data

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Starburst vs. Dremio: Breaking Down the Numbers

Using SQL for ETL offers many advantages. To help understand how it’s best to look at the ETL process more broadly. Data pipelines use the Extract, Transform, Load (ETL) process as their dominant data processing model. Typically, data engineers construct ETL pipelines using code to manage the movement and transformation of data. You can write this code in various programming languages, including SQL, Python, and Apache Spark. Today, performing ETL using SQL is an increasingly popular practice. In each case, the ETL process has three stages:

First, ETL extracts data from multiple source systems.
Next, it transforms the data structure.
Finally, it loads it into a target analytics system.

Ultimately, ETL enables data scientists to perform data analysis by powering aggregate queries, data analysis tools, or AI models.

This article will help you understand the ETL process from the inside out. Specifically, it will show you how to simply ETL using SQL instead of Spark or Python. Throughout, I will argue that SQL is the most effective way to manage a data pipeline and the ETL process within it.

What is ETL?

First, it is essential to understand that ETL is a process, not an end state. This process transforms complex data from its raw form through intermediate stages to a finished state. From here, data analytics tools, including Business Intelligence (BI) dashboards and queries, can analyze it. Increasingly, ETL can also be used to feed AI models, and this use case is likely to grow over time.

Overall, ETL consists of three key stages:

Extract data using ETL

ETL begins with data extraction from source systems. Data enters an ETL pipeline from various sources, including data lakehouses, data lakes, data warehouses, databases, APIs, cloud platforms, and devices that generate sensor data. Importantly, source data often varies in structure.

Transform data using ETL

Next, the data pipeline has to transform the data. Without this step, disparate data structures cannot be analyzed properly. This multi-stage process typically includes:

Data validation to ensure data accuracy and adherence to defined rules.
Data cleansing to remove errors and standardize the dataset.
Data deduplication is used to manage data consistency and eliminate duplicate records.
Data enrichment to enhance datasets by adding valuable external information.
Data transformation to convert data into a format suitable for analysis.
Data integration to combine data from various sources into a unified data structure suitable for comprehensive analysis.
Data filtering to discard irrelevant or unnecessary data before loading it into the target system.

By the end of the transformation process, raw data will be converted from various data types into a single, usable format that fits the structure required for consumption.

Load data using ETL

Finally, the data pipeline loads the transformed data into a target system. From here, it can be queried and used for data analytics dashboards or AI/ML use cases. Importantly, the target system itself varies. It could be a data lakehouse, data lake, or data warehouse.

For more information on ETL and its impact on business, check out the video below.

Why is ETL important?

ETL is the core of any data analytics architecture. Data engineers use ETL to access data using connectors. In this sense, ETL is a data integration activity that helps manage data quality across large volumes of data.

How ETL helps businesses

A strong ETL pipeline has a direct impact on business outcomes. Without it, analysts cannot analyze data from various source systems and data types together. Without analysis, data has no value. It cannot drive business insights or decision-making.

How to manage ETL

Traditionally, ETL pipelines are centered around data warehouses. The advent of data lakehouses has disrupted the traditional ETL paradigm, offering a more flexible, scalable, and cost-efficient solution, particularly when using relational databases.

Let’s review some of the differences between an ETL pipeline in a data warehouse versus a data lakehouse.

Using ETL in a data warehouse

Data warehouses have historically relied on ETL pipelines, whether accessing data on-premises or in the cloud. These systems manage the flow of structured data, bringing together different types of data from various sources. In data warehousing, ETL pipelines are usually optimized to achieve two key goals:

Centralization
Structure

All data warehouses operate using these two principles, and all data is extracted, transformed, and loaded using this data management methodology.

Traditional data warehouse systems include:

Newer cloud data warehouses, such as Snowflake, Amazon AWS Athena, or Google BigQuery, take a similar approach but also utilize cloud technology.

Schema on write

Additionally, both on-premises and cloud data warehouses use a schema-on-write process. Using this approach, transformations are performed before the data is loaded into the warehouse.

This can lead to high upfront costs and time delays as data needs to undergo a rigorous ETL process before the data can enter the warehouse. This approach often causes older on-premises warehouses to struggle with unstructured or semi-structured data.

Using ETL in a data lakehouse or data lake

In contrast, data lakehouses approach ETL in a different manner. Data lakes and data lakehouses also use ETL pipelines to extract, transform, and load data from different sources. These systems typically utilize structured, semi-structured, and unstructured data.

Instead of using a schema-on-write approach, data lakehouses employ a schema-on-read approach. This allows data of any structure to be added in its raw form. This approach is sometimes referred to as ELT because the transformation process occurs only when the data is read. In practice, many data lakehouses store complex datasets comprising structured, semi-structured, and unstructured data.

Using schema-on-read, transformation can occur at the query stage, allowing for more flexibility. This facilitates a more decentralized approach, using metadata to manage data until transformation is needed.

Instead of moving data to a centralized location, you can query it in place. This approach directly reduces the complexity and cost of traditional ETL processes.

Languages commonly used for ETL

Several languages are typically used to manage an ETL pipeline, including SQL, Apache Spark, and Python. Languages like Scala, R, and Java are also used. Out of these options, Spark and Python are very popular, followed by SQL.

4 reasons why using SQL for ETL makes sense

Despite Spark and Python being commonly used languages to manage ETL pipelines, using SQL often makes the most sense for several reasons.

1. Using SQL for ETL is easy

SQL is a very easy language to learn, with a semantic and intuitive syntax. For this reason, managing a data pipeline using SQL queries takes some of the complexity out of the process.

2. SQL is everywhere

SQL is the standard query language used across big data. It is used by data engineers, data scientists, and others. For this reason, using it to manage a data pipeline makes the process more accessible to a wider audience.

3. SQL is widely compatible

SQL is used in other parts of the data world, particularly in queries. Using SQL in a data pipeline also means that a single language is used across the entire process, resulting in overlap.

4. SQL is performant

Starburst Galaxy makes using SQL to run a data pipeline an easy and performant option. Instead of operating an ETL data pipeline using Spark or Python, you can streamline the process using SQL to access multiple data sources at once. Although Spark and Python are powerful, manual languages, SQL is typically much easier to maintain and optimize.

Starburst Galaxy also allows real-time data processing using real-time data ingestion. This approach greatly simplifies the task of adding data to the ETL pipeline and streamlines the extraction process, improving efficiency.

5 SQL best practices for your ETL process

Ready to construct an ETL data pipeline using Starburst Galaxy and SQL? Here are 5 best practices to help get you started.

1) Employ modularity

First, SQL works best when you break each unit into smaller pieces. Like many other programming languages, modularity helps you gain efficiencies and keep each unit of code focused on a single task. Using this approach, it’s easier to complete complex ETL workloads by combining multiple smaller SQL functions into a single workload. Think of this as SQL’s nod to object-oriented programming.

2) Use windowing functions

Next, consider using a windowing function. A SQL windowing function performs a calculation across a given series of rows. These rows are known as the window. Windowing functions are powerful because they allow you to perform a series of operations across a number of rows while maintaining access to individual rows at the same time. They are particularly useful for calculating things like rolling averages and running totals.

3) Federate your data sources

Additionally, Starburst Galaxy allows you to connect multiple data sources and access them as if they were a single data source using SQL. This process is known as data federation (or query federation), and it is a very powerful tool. It is particularly useful when offloading workloads from expensive data warehouses to less expensive data lakehouses. Starburst Enterprise can also connect to cloud and on-premises data sources.

4) Optimize queries for partitioning and indexing

Next, SQL can also be used to partition and index your data. Both partitioning and indexing intelligently reduce the amount of data that needs to be scanned by a query, reducing the amount of work needed. This is particularly useful when using very large datasets. Indexing and partitioning reduce the amount of data under consideration, making it easier to perform ETL efficiently.

5) Employ SQL scheduling automation

Finally, SQL-based data pipelines can be scheduled and automated. This approach works best with Starburst Galaxy. By using scheduling and orchestration features, you can automate ETL processes to run at regular intervals or trigger based on events, ensuring that your data is always fresh and ready for analysis. It’s also possible to use third-party scheduling tools, such as dbt, in conjunction with Starburst Galaxy.

Choosing the right way to manage ETL using SQL

Starburst Galaxy is not just a query engine; it is an ETL tool as well. There are a few reasons for this. First, it allows you to use SQL, the ubiquitous language of data science, to extract, transform, and load data from multiple sources. Second, it excels when used with data lakes and data lakehouses, especially those using a Starburst Iceberg Icehouse architecture, but it can also be used with data warehouses. Most of all, it offers simplicity and ease of use. Data pipelines are complex enough without adding additional complexity. Overall, Starburst Galaxy helps reduce complexity while preserving power.

Advantages of using Starburst Galaxy to manage ETL workflows using SQL

Overall, Starburst Galaxy offers the following key advantages:

Scalability: First, Starburst Galaxy’s cloud-native architecture automatically scales to meet growing data volumes without sacrificing performance.
Cost Effectiveness: Second, Starburst decouples storage and compute. This means that you only pay for the compute resources you use. This pricing model significantly reduces costs compared to traditional data warehouses.
Open Architecture: Built on open-source technologies like Trino and Apache Iceberg, Starburst ensures flexibility and avoids vendor lock-in, allowing you to future-proof your data strategy.

Looking to boil this all down? The image below encapsulates this strategy.

Summary list of best practices, 5 ways to simplify ETL using SQL. 1. Employ modularity 2. Use windowing functions 3. Federate your data sources 4. Optimize queries for partitioning and indexing 5. Employ SQL scheduling automation

Want to learn more about ETL and Starburst Galaxy? Starburst Academy has a free course on ETL and data pipelines. Sign up today!

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The Data Engineers Guide to Iceberg v3

5 ways to simplify ETL using SQL

More deployment options

Start for Free with Starburst Galaxy

Starburst vs. Dremio: Breaking Down the Numbers

What is ETL?

Extract data using ETL

Transform data using ETL

Load data using ETL

Why is ETL important?

How ETL helps businesses

How to manage ETL

Using ETL in a data warehouse

Schema on write

Using ETL in a data lakehouse or data lake

Languages commonly used for ETL

4 reasons why using SQL for ETL makes sense

1. Using SQL for ETL is easy

2. SQL is everywhere

3. SQL is widely compatible

4. SQL is performant

5 SQL best practices for your ETL process

1) Employ modularity

2) Use windowing functions

3) Federate your data sources

4) Optimize queries for partitioning and indexing

5) Employ SQL scheduling automation

Choosing the right way to manage ETL using SQL

Advantages of using Starburst Galaxy to manage ETL workflows using SQL

Start for Free with Starburst Galaxy