Snowflake is a terrific cloud data warehouse (CDW). They were the first to provide an easy-to-use, autoscaling, high-performant analytical platform in the cloud. Companies flocked from their on-prem warehouse appliances to the cloud and they haven’t looked back since.
As companies have grown more comfortable with their applications and their analytics being served out of the cloud, there have been challenges around cost and the flexibility that cloud data warehouse provides. When all of your data is in a single cloud warehouse, you are at the mercy of that vendor and are unable to take advantage of new technologies and control your costs.
Starburst Galaxy provides an analytical platform that can provide the following:
- Extract and process data on your data lake
- Provide the fastest, highest concurrent query engine on your data lake
- Federate data from many different sources in real-time using one of the many available connectors
All this is in an open data lake architecture. Your data isn’t held hostage and lives in your account and you can pick and choose the engine that suits your needs.
Landing, processing and serving up this data from a single storage location provides numerous benefits such as:
- Total ownership of your data
- Less “surface area” meaning the more you copy data, the less secure it is
- Choose the engine for your use cases and needs
- Avoiding vendor and storage lock-in
In this blog post, I will discuss the two ways Starburst Galaxy can augment or offload your CDW.
First, we’ll cover how Starburst Galaxy augments your current CDW.
Augmenting your cloud data warehouse (federation)
Companies often stage their data in cloud object storage before copying it to a cloud data warehouse. Not all of this data typically makes it into the CDW. Additionally, there are usually other data sources that contain data that users would like to join with the data in their CDW.
Starburst Galaxy contains a wide range of connectors to relational and non-relational data sources such as PostgreSQL, Mongo, and Elasticsearch. There are also connectors to cloud cdw systems such as Snowflake, Redshift, BigQuery and Synapse.
Data is joined in real-time between these systems using our SQL cost-based optimizer. This allows standard SQL to be used across different systems.
snowflake, s3, sqlserver
snowflake.customer_id = s3.customer_id
AND sqlserver.product_id = s3.product_id
This allows data to be joined from object storage, relational, and non-relational sources to the CDW. Performing real-time analysis of data across data stores without needing to copy this data into the CDW saves time, money and allows for quicker insights.
Offloading workloads from your expensive cloud data warehouse
Two of the most common things we hear from companies that went “all-in” on a CDW for their organization are:
- The cost has risen to out-of-control levels and we must do something about it
- We feel like our data is locked in and we are unable to take advantage of the many existing and new technologies
With Starburst Galaxy, you can land your data in any cloud object store, process it through the traditional layers and serve up those objects to a variety of end users with a variety of use cases. The best part is the data is stored in open formats such as parquet and orc and it’s located in YOUR cloud account providing you with the ultimate flexibility to use any engine you want to provide analytics on your data.
One of the biggest misconceptions we continue to hear from companies that love the idea of an open data lake is “how do I create tables on my files in my cloud storage?”. This is a valid argument and is mostly a leftover from Hadoop days where Hive tables were usually large, monolithic structures and people were taught “joins are bad”.
Let’s take the industry standard TPC-H benchmark for example. The table diagram is just a standard traditional database ERD with tables. You would create these tables, insert, update, merge and even delete data in them like a normal database.
Additionally, joining tables is fully recommended again just like a regular database:
SELECT FIRST 10
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
customer, orders, lineitem
c_mktsegment = 'BUILDING'
c_custkey = o_custkey
l_orderkey = o_orderkey
o_orderdate < MDY(3, 15, 1995)
l_shipdate > MDY(3, 15, 1995)
l_orderkey, o_orderdate, o_shippriority
revenue DESC, o_orderdate
The best part is Starburst Galaxy is built upon the open source Trino engine developed at Facebook to handle 1000s of concurrent users across any BI tool and is being used at some of the top companies in the world.
Here is a handy feature matrix showing how using Starburst Galaxy to build your open data lake on your cloud storage provides not only the same benefits as a CDW, it’s completely open so you can plug in other engines if and when needed: (see this blog for more information on a multi-engine data lake)
|Open Data Lake
|Creating SQL tables
|High performance queries and joins
Now is the time to turn your data swamp into an open, well structured, high performing, open data lake that can serve ALL of your analytical use cases out of a single storage platform in your account. Sounds too good to be true doesn’t it?
If you have any questions, please feel free to reach out to us. We have also launched Starburst Academy with many free courses including our Data Foundations, our self-paced, hands-on learning course which covers data lakes extensively.