
Batch processing, Iceberg, and Delta Lake for Starburst Galaxy


Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data
Share
ETL is the engine behind the data pipeline process, moving data from a raw state to a consumable one. Let’s unpack how this typically operates in a modern data lake or data lakehouse. Later, we’ll take a tour to see how Starburst Galaxy fits in this picture and how it can be used to construct the Land, Structure, and Consume layers typical of a modern data lake.
Data pipelines feed modern data lakes, also known as data lakehouses, in a particular way. To understand how these technologies work, it’s important to understand what makes these data lakes unique compared to other storage systems. Additionally, it’s important to understand how data lakehouse table formats like Apache Iceberg, Delta Lake, and Hudi play a key role in this ecosystem.
Traditionally, data lakes operate using cloud object storage, allowing them to hold structured data, semi-structured data, and unstructured data together in a single repository. Importantly, most data lakes also use the Hive table format, an older technology still widely used in many data lakes today. Hive allows data lakes to be queried using SQL but does not allow for any of the advanced features of modern data lakes, also known as data lakehouses.
Traditional data lakes can be constructed using either a three-part or two-part data pipeline architecture.
Land Layer
First, data lands in its raw state.
Consume
Finally, data is transformed into a finished, consumable state.
Land Layer
First, data lands in its raw state.
Structure Layer
Second, data is transformed into an intermediate state.
Consume Layer
Finally, data is transformed into a finished, consumable state in a third layer.
Another important technology is the Open data lakehouses. This approach uses modern table formats, allowing for features not possible using Hive.
These include:
Modern data lakes are typically constructed using the three-part data pipeline architecture.
Both traditional and modern data lakes make use of a variety of file formats. These control how the data is stored in the lake.
Traditional data lake file formats include:
Modern data lake file formats include:
Let’s unpack the concept of ETL pipelines and understand how they operate at the heart of data pipelines. Data begins in source systems, and you can think of these systems as the start of the data pipeline. These systems vary but generally conform to the following pattern.
Source data originates from multiple sources, including:
Like all data lakes, data in the land layer arrives in multiple structures.
Data in the land layer is always stored in the same format in which it is ingested. This preserves a record of the data in its original form, before any transformations or cleansing takes place. If there is ever a need to return to this original state, this copy is a valuable resource.
You can think of this as similar to a photographer shooting photos in the RAW photo format. The RAW format includes all available information from the sensor, even information that may not actually be used. Nonetheless, the RAW format is a good choice for photography because post-production editing might make use of that information at a later date.
In a similar way, the land layer preserves the maximum amount of information, even if it is not needed. This ensures that all available data is retained for future use.
Data is typically copied to the land layer using one of two methods: batch ingestion or streaming ingestion.
Batch Ingestion is a classic approach to ingestion that involves copying a group of files at a regularly-scheduled interval or after a trigger condition has been reached.
For example, a batch-ingested workflow may add new data to the land layer every four hours, every hour, or every fifteen minutes.
The alternative to batch ingestion is streaming ingestion. Streaming captures new data in real time. To do this, data is divided into tiny, micro-batches continuously processed as they arrive.
Streaming is typically handled by technologies that specialize in ingestion, including:
A messaging function used as an intermediary agent between two systems, allowing the source system to transfer data to the Land layer reliably. Starburst Galaxy and Starburst Enterprise both integrate with Kafka.
Flink is a streaming platform designed to facilitate data ingestion. Unlike Kafka, it is not considered a messaging platform. It can also be used for batch processes, and is therefore a versatile tool. Starburst Galaxy and Starburst Enterprise both integrate with Flink.
Spark is a general-purpose framework that allows for custom data ingestion tasks. Importantly, it is not a streaming or messaging platform. Instead, it is similar to a general scripting language. In that sense, it is both powerful and customizable, but also technical and complex. Many data pipeline ingestion workflows in the world run using Spark.
Advantages |
Disadvantages |
|
Batch ingestion |
|
|
Streaming ingestion |
|
|
When data is transformed from the Land layer into a new location, that new location is called the Structure layer.
In a data lakehouse, the Structure layer is the source of truth for all operations. Importantly, most of the major transformations occur between the Land layer and Structure layer.
The exact nature of these transformations depends on the datasets and structures involved, but typically include the following steps.
Data from multiple locations usually arrives in different structures. Normalization harmonizes these differences into a common structure. To do this, divergent fields must be mapped into others.
For example, the date format from one organization might conform to the mm-dd-yyyy structure, while the date format from another organization might conform to the dd-mm-yyyy structure. Or a Customer ID from one system might include a different number of digits than the Customer ID from another.
If you want to be able to query data from both datasets, you are going to have to normalize the data and transform the data into a single structure with defined rules.
In the modern data lake, this typically happens after the data is queried using schema-on-read. Starburst Galaxy and Starburst Enterprise both allow you to use SQL to enact these transformations, including use of INSERT INTO commands that achieve normalization. The exact degree of normalization depends on the difference in structure between the datasets involved.
Data validation is also used as a tool in the Structure layer. This step involves checking that all data conforms to the agreed normalization transformations and sets down rules for cleansing any data that does not conform.
Data validation also uses SQL and both Starburst Galaxy and Starburst Enterprise make it easy to perform data validation in a data pipeline. To do this, IF statements are written to check for certain conditions using boolean logic. If the conditions are true, the data is valid; if the conditions are false, the data is not valid.
The Structure layer also involves data enrichment. This occurs when the number of columns in a dataset’s schema needs to be widened to accommodate additional data. This might occur if a schema evolves beyond its initial architecture or if a new column is being added for other reasons.
For example, a customer field might include an additional column for credit scores only when such a check is necessary. This addition would involve enrichment. Notably, enriched data is often taken from outside systems beyond the Land layer.
Like the other transformations in the Structure layer, enrichment can be achieved using SQL in Starburst Galaxy or Starburst Enterprise. To do this, the business logic controlling the enrichment must be translated into SQL code, often involving conditional logic and IF statements.
Sometimes this involves the use of specialized functions known as user defined functions. Functions of this type typically use the ADD function to expose the location of the enriched data.
Finally, technical transformation harmonizes data from different schemas and structures into one data type. When complete, the new dataset is saved into a new format. This is the moment when the Structure layer is created.
First, the Consume layer contains data that has finished its journey through the data pipeline. At this stage, it is ready to be used by data consumers.
Importantly, data is consumed in a number of different ways. These include:
Queries
Traditionally, the most basic way in which data is consumed is through ad hoc queries. These might be:
Business Intelligence (BI) tools
Additionally, data from the consume layer is also used to feed BI tools, including:
Data products
Beyond this, data products provide another pathway to consume data. These curated datasets allow you to tag, save, and share data across teams. This exciting development empowers data consumers to take charge of their own datasets in new ways. Starburst Galaxy and Starburst Enterprise both use data products to enhance data visibility, data discovery, data sharing, and data governance.
The first thing to understand is that users consume data in a number of different ways. Each of these approaches is important, and the data lakehouse addresses each of them differently. Let’s look through the different ways that data lakehouses use data to see how data is consumed in each case.
First, you have data aggregation. Aggregations involve the grouping of data according to conditional logic. Typically, these analytical queries use the SQL GROUP BY command to return values that match the conditions set out.
Importantly, sometimes multiple aggregations are necessary. In these instances, a Rollup can be used to execute multiple GROUP BY commands together.
A special type of rollup known as a CUBE combines multiple Rollups, generating a multi-dimensional model.
Another approach involves the use of Windowing functions. These can be used to create rolling averages that compare one column to another.
For example, windowing might be used to return stock prices or average sales in real time. This is typically used by data consumers to create reports, find trends, and build actionable business intelligence.
First, let’s unpack the data lakehouse by exploring a reference architecture in more detail. The video below introduces the topic and outlines the key role played by each component in the data pipeline process.
Next, it’s important to unpack the complex vocabulary surrounding data lakehouses. The video below unpacks this terminology in more detail.
Finally, it’s important to understand that data pipelines can be constructed on data lakehouses using a number of different technologies and tools. The video below outlines which tools you are likely to encounter in this space and where Starburst Galaxy fits.
The analytics platform for your data lake.