Offload Your Cloud Data Warehouse Workloads

Land your data in any cloud object store, process it through the traditional layers and serve up those objects to a variety of end users with a variety of use cases.

StrategyApril 19, 2023

Tom Nats

Director of Customer Solutions

Starburst

Tom Nats

Director of Customer Solutions

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Introduction to Apache Iceberg In Trino

Snowflake is a terrific cloud data warehouse (CDW). They were the first to provide an easy-to-use, autoscaling, high-performing analytical platform in the cloud. Companies flocked from their on-prem warehouse appliances to the cloud, and they haven’t looked back since.

As companies have grown more comfortable with their applications and their analytics being served out of the cloud, there have been challenges around cost and the flexibility that cloud data warehouse provides. When all of your data is in a single cloud warehouse, you are at the mercy of that vendor and are unable to take advantage of new technologies and control your costs.

Starburst Galaxy provides an analytical platform that can provide the following:

Extract and process data on your data lake
Provide the fastest, highest concurrent query engine on your data lake
Federate data from many different sources in real-time using one of the many available connectors

All this is in an open data lake architecture. Your data isn’t held hostage and lives in your account and you can pick and choose the engine that suits your needs.

Landing, processing, and serving up this data from a single storage location provides numerous benefits such as:

Total ownership of your data
Less “surface area,” meaning the more you copy data, the less secure it is
Choose the engine for your use cases and needs
Avoiding vendor and storage lock-in

In this blog post, I will discuss the two ways Starburst Galaxy can augment or offload your CDW.

First, we’ll cover how Starburst Galaxy augments your current CDW.

Augmenting your cloud data warehouse (federation)

Companies often stage their data in cloud object storage before copying it to a cloud data warehouse. Not all of this data typically makes it into the CDW. Additionally, there are usually other data sources that contain data that users would like to join with the data in their CDW.

Starburst Galaxy contains a wide range of connectors to relational and non-relational data sources such as PostgreSQL, MongoDB, and Elasticsearch. There are also connectors to cloud CDW systems such as Snowflake, Redshift, BigQuery, and Synapse.

Data is joined in real-time between these systems using our SQL cost-based optimizer. This allows standard SQL to be used across different systems.

Example:

SELECT 
snowflake.customer_region,
sqlserver.product_type,
sum(s3.total_sales) total_sales
FROM 
snowflake, s3, sqlserver
WHERE
snowflake.customer_id = s3.customer_id
AND sqlserver.product_id = s3.product_id
GROUP BY
customer_region,
product_type;

This allows data to be joined from object storage, relational, and non-relational sources to the CDW. Performing real-time analysis of data across data stores without needing to copy this data into the CDW saves time, money, and allows for quicker insights.

Offloading workloads from your expensive cloud data warehouse

Two of the most common things we hear from companies that went “all-in” on a CDW for their organization are:

The cost has risen to out-of-control levels, and we must do something about it
We feel like our data is locked in, and we are unable to take advantage of the many existing and new technologies

With Starburst Galaxy, you can land your data in any cloud object store, process it through the traditional layers, and serve up those objects to a variety of end users with a variety of use cases. The best part is that the data is stored in open formats such as parquet and orc, and it’s located in YOUR cloud account, providing you with the ultimate flexibility to use any engine you want to provide analytics on your data.

One of the biggest misconceptions we continue to hear from companies that love the idea of an open data lake is “how do I create tables on my files in my cloud storage?”. This is a valid argument and is mostly a leftover from Hadoop days, where Hive tables were usually large, monolithic structures, and people were taught “joins are bad”.

Let’s take the industry standard TPC-H benchmark, for example. The table diagram is just a standard traditional database ERD with tables. You would create these tables, insert, update, merge, and even delete data in them like a normal database.

Additionally, joining tables is fully recommended again, just like a regular database:

SELECT FIRST 10
l_orderkey,
 SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM
customer, orders, lineitem
WHERE
c_mktsegment = 'BUILDING'
AND
c_custkey = o_custkey
AND
l_orderkey = o_orderkey
AND
o_orderdate < MDY(3, 15, 1995)
AND
l_shipdate > MDY(3, 15, 1995)
GROUP BY
l_orderkey, o_orderdate, o_shippriority
ORDER BY
revenue DESC, o_orderdate

The best part is that Starburst Galaxy is built upon the open source Trino engine developed at Facebook to handle 1000s of concurrent users across any BI tool and is being used at some of the top companies in the world.

Here is a handy feature matrix showing how using Starburst Galaxy to build your open data lake on your cloud storage provides not only the same benefits as a CDW; it’s completely open, so you can plug in other engines if and when needed: (see this blog for more information on a multi-engine data lake)

Features Matrix	Open Data Lake	CDW
Creating SQL tables	✓	✓
Updating data	✓	✓
Deleting data	✓	✓
High-performance queries and joins	✓	✓
Data sharing	✓	✓
Multi-engine	✓	✕

Common data warehouse offload use cases

What to offload and when depends on your organization’s data holdings. Log analysis, machine learning staging, and regulatory archiving are common use cases for offloading:

Log analysis. Log analysis tracks security and identifies inefficiencies by sorting through existing records to find patterns within system behavior. While these historical analyses are critical, they don’t usually require lightning-fast responses, so logs are a great candidate for offloading to cheaper, more compressed storage.
Machine Learning Staging. Machine learning training reads huge volumes of data from training datasets. Open file formats like Parquet offer optimal compression for data sets offloaded to a Parquet-equipped warehouse like Starburst Galaxy.
Regulatory archiving. Various regulations require certain data, especially financial and healthcare data, to be archived for a fixed period. This data is rarely queried, which makes it a perfect candidate for warehouse offload.

As you can see, any data you are storing that is rarely being overwritten is a strong candidate for warehouse offload, whether it’s active like an ML dataset or sitting cold like a regulatory archive.

Now is the time to turn your data swamp into an open, well-structured, high-performing, open data lake that can serve ALL of your analytical use cases out of a single storage platform in your account.

Try Starburst Galaxy

If you have any questions, please feel free to reach out to us. We have also launched Starburst Academy with many free courses, including our Data Foundations, our self-paced, hands-on learning course, which covers data lakes extensively.

FAQs about data warehouse offloading

What are the primary benefits of offloading data warehouse workloads?

Offloading workloads from a traditional warehouse to a data lakehouse architecture significantly reduces infrastructure costs by leveraging affordable cloud object storage. Beyond cost savings, this strategy eliminates vendor lock-in by keeping data in open formats like Parquet or ORC. It also grants organizations full ownership of their data and enables them to use various high-performance query engines tailored to specific analytical use cases.

How does offloading data differ from augmenting a data warehouse?

Augmenting a warehouse typically involves leaving data where it resides across various sources and using a federation layer to join it with warehouse records in real time for enriched insights.

Offloading, on the other hand, focuses on moving data storage and heavy processing tasks. It moves them out of expensive, proprietary systems and into a cost-effective open data lake. Despite that movement, it still supports querying that data using standard SQL.

Can I use standard SQL on data offloaded to cloud object storage?

Yes. You can execute standard SQL queries directly on files stored in cloud object storage, treating them just like tables in a traditional database. This capability supports complex operations, including joins, updates, merges, and deletions. In doing so, data teams can leverage their existing SQL skills without managing the monolithic structures associated with legacy Hadoop environments.

Will offloading workloads compromise query performance?

Offloading workloads to a data lakehouse architecture does not inherently sacrifice performance, as modern query engines are designed to handle thousands of concurrent users at high speed. By utilizing cost-based optimizers and separating compute from storage, organizations can achieve high-performance analytics on vast datasets without the concurrency limits or latency often encountered in constrained proprietary warehouse environments.

How does data warehouse offloading help with data retention strategies?

Data warehouse offloading enables organizations to implement efficient tiered storage strategies by categorizing data into hot, warm, and cold tiers based on access frequency. While critical “hot” data may remain in high-performance systems, massive volumes of historical “warm” or “cold” data can be moved to economical object storage, ensuring it remains accessible for compliance and retrospective analysis without incurring premium storage fees.