×

Introducing automated data lake optimization in Starburst Galaxy

Automate the maintenance of your data lake to guarantee optimal query performance and storage utilization

Published: November 28, 2023

Modern table formats like Apache Iceberg have made the aspiration of data warehouse-like performance within a data lake an exciting reality. This is in large part due to these table formats enabling efficient DML operations such as INSERT, DELETE, UPDATE, and MERGE statements. However, managing a data lake is unlike a database or data warehouse where those systems manage the underlying data files for you. 

Instead, as these DML operations occur, the number of files in the lake can increase so much that it negatively impacts query performance and storage utilization. While there are manual ways to handle this, it is often reactive or requires engineering to build something to be proactive. And these ways require time, resources, and expertise from the data team. 

That’s why we are excited to introduce the concept of automated data optimization in Starburst Galaxy. Data optimization makes it seamless to schedule routine data maintenance operations on your lake tables.

Automated data optimization

Automated data optimization can be divided into four main operations:

  1. Data compaction
  2. Profiling and statistics
  3. Vacuuming
  4. Data retention

 

Let’s take an in-depth look at these operations and their use cases. 

Data Compaction

An impressive benefit of the Iceberg table format is the support of close to real-time write capabilities on lake tables. In rapid ingestion use cases like streaming ingest, the data may arrive in smaller files, making it conducive to faster write execution, but not for faster querying of the data.

Data compaction helps solve this “small file” problem by rewriting the smaller files into one larger optimal file size resulting in faster query performance.

Profiling and Statistics

For those newer to Starburst Galaxy’s underlying architecture, it is built by the co-creators of Trino and uses many of the same concepts under-the-hood, including a query optimizer. The query optimizer helps ensure your queries are run as efficiently as possible on a given table in your lake. 

One of the key inputs to the query optimizer is metrics on the lake table. The fresher the table metrics are, the more efficiently queries will be executed against that table. 

The profiling and statistics maintenance operation automates the metric refresh process by analyzing the lake table and returning relevant metrics to the query optimizer on a scheduled basis. 

In addition to this, users will be able to view a subset of the profiling information within the Starburst Galaxy UI to better understand the composition of their data lake. 

Vacuuming

Oftentimes, failed jobs on a data lake lead to orphaned files. These files are not tied to a specific snapshot due to the failed query and therefore can easily be overlooked by common delete operations related to data retention. As the volume of these orphaned files grow, the files can clutter the data lake, resulting in additional storage costs.

The new vacuum maintenance operation in Starburst Galaxy finds and deletes these orphaned files – helping users maintain an efficient data lake. 

Data Retention

A valuable benefit of the Iceberg table format is to travel back in time to a previous version of the table via snapshots. Snapshots help with both time travel and version control, helping many data teams maintain compliance and troubleshoot data issues quickly. 

However, in rapid ingestion use cases like streaming ingestion, the sheer amount of changes to the data lake table will result in a huge number of snapshots. This results in additional storage costs and slow query performance.

The data retention maintenance task helps users delete snapshots that are no longer needed. This feature helps users specify a retention threshold for snapshots based on a historical point in time – e.g. 30 days. 

How to get started

These features are slated for private preview in early December. If you’re interested in being a part of the private preview, apply here.

Start for Free with Starburst Galaxy

Up to $500 in usage credits included

Please fill in all required fields and ensure you are using a valid email address.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.

s