Snowflake, Databricks, Tabular, Iceberg, what does it all mean?

StrategyJune 11, 2024

Matt Fuller

Vice President, AI/ML Products

Starburst

Tobias Ternstrom

Chief Product Officer

Starburst

Matt Fuller

Vice President, AI/ML Products

Starburst

Tobias Ternstrom

Chief Product Officer

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

State of data catalogs: The battle for your metadata

What happened last week?

Snowflake Summit ran from Tuesday (June 4, 2024) through Thursday. This year, the conference was overshadowed by two significant announcements:

Databricks announces its acquisition of Tabular, the company founded by the original creators of Apache Iceberg
Snowflake announces open-source Polaris, an open-source implementation of the Iceberg REST catalog

In light of these, other announcements from the Snowflake Summit feel like nothing more than a footnote¹.

The Tabular acquisition by Databricks is significant for two reasons:

It further solidifies that Iceberg has won the lakehouse format wars. Databricks paid a handsome price to acquire Tabular, a move that rightly begs the question: What does this mean for the future of Delta Lake?
Snowflake had to do something. In this case, they preemptively announced open-source Polaris in a move that one can only presume will force Databricks’ hand to open-source Unity Catalog sooner or later (perhaps during this week’s Data + AI Summit)².

Let’s rewind, an Iceberg refresher

Iceberg is an open table format created at Netflix in 2017 and it became an Apache project in 2020 (Apache Iceberg). Netflix developed Iceberg as the storage format for the query engine Trino (known at the time as Presto) to address the performance and usability challenges with Apache Hive tables in their large and demanding data lake environments. Trino + Iceberg is known as an Icehouse architecture and has enabled companies like Netflix to move away from costly data warehouses. And because Iceberg is an open format, anyone can develop software to read and write Iceberg tables. For example, Iceberg is also supported by other data processing engines like Apache Spark and Apache Flink.

What does this mean for Snowflake and Databricks customers and for the broader data community?

While we watch these companies battle it out publicly, let’s discuss what it means for data teams, most of whom have at least some budget against Snowflake or Databricks. For the first time in the 40+ years of data warehousing history, the entire industry has woken up to the reality that direct access to the underlying files that store their data should be open. Storing data in open formats, specifically Apache Iceberg, in an object storage lake, has enabled this. This type of architectural change happens rarely, something like once every 10-20 years in enterprise software.

We fundamentally believe these announcements are good for Snowflake and Databricks customers and help to reduce vendor lock-in. Snowflake customers have long complained that, while they appreciate the Snowflake product, it can become very expensive to use, and it can also be very difficult to move off of if they want to.

With the attention on Iceberg and Snowflake Polaris, there has never been a better time for open standards in the data warehousing space. Starburst was founded seven years ago on the premise that data should always belong to the customer, and no one vendor should be able to lock in customers, limiting them from accessing and analyzing the data with whatever tools they want. Snowflake’s Polaris announcement has now finally recognized this basic customer need. With Databricks and Tabular, we can also expect Databricks to increase support for the Iceberg table format next to their own Delta Lake. This means that customers who start adopting Iceberg can leverage any compute engine on top of their data, which would otherwise be locked away in a proprietary format. Practically speaking, this means a huge potential to improve price-performance by choosing the best engine for the job.

Figure 1 – Polaris Catalog demo from Snowflake showcasing SQL engines that can be used to query Iceberg tables in Polaris.

While these developments are exciting for data in the cloud, they may leave organizations with significant on-premises footprints feeling left out (A recent CRN report indicates that 40% of global data center capacity is on-premise today³.). At Starburst we recognize this gap and believe the freedom and openness afforded by modern table formats should be available to any deployment model, whether on-premises, hybrid, or in the public cloud. Our partnership with Dell makes this vision a reality. The Dell Data Analytics Engine, powered by Starburst, allows enterprises with data on-premises to deploy a modern lakehouse architecture without a cloud migration. In addition, you can expect to see Iceberg management features for on-premises deployments in future releases.

For Starbust customers, these announcements are simply additional integration options. Starburst and Trino have long supported Iceberg REST catalogs since it was introduced 2 years ago. In addition to supporting the Apache Iceberg REST catalog, we added an integration for Tabular’s REST catalog in early 2023. Integration with Databricks’ Unity Catalog is currently in Private Preview with general availability planned in the coming months. More recently we joined Confluent as a launch partner for their TableFlow REST catalog. With the Snowflake Polaris announcement, it is another REST-based Iceberg catalog that we can integrate with as we’ve been doing so far with others. The approach with our Gravity catalog is simply to integrate with a variety of catalogs to provide our customers the most freedom of choice.

Practical tips for getting started with Iceberg

Iceberg is the strategic table format for Starburst customers, and we have been a part of the Iceberg community before Iceberg became an Apache project. We have learned a lot about building on Iceberg from working with our customers and the Iceberg community, and we are happy to share some advice to help Snowflake customers create their Iceberg strategy.

1. Use Iceberg for new tables

If you’re a Snowflake customer, most of your data is likely stored in native Snowflake tables. It is important to note that you can combine Snowflake and Iceberg tables in a single query, which means you can gradually start adopting Iceberg. You should strongly consider storing your Iceberg tables on your object storage of choice in your cloud account. In addition, from a Snowflake end-user perspective, they won’t directly see a difference between Snowflake and Iceberg tables. To reduce risk and to start learning more about Iceberg tables, a good path can be to use the Iceberg table format for new tables added to your solution. In this way, you start gaining some Iceberg experience and learn how Snowflake behaves when using Iceberg tables or when you combine Iceberg and Snowflake tables.

2. Use Iceberg to move your transformations to a more cost-effective engine

Transformations to prepare raw data landing in your solution to be ready for end-user consumption can require a significant amount of Snowflake compute consumption. Because Iceberg tables can be read and written to by using something other than Snowflake, you can use an alternate cost-effective, and preferably open, SQL engine like Trino to perform your transformations, even though your end-users still query the data using Snowflake (and don’t notice a difference). By doing this, you can save money and start heading towards an open data lakehouse architecture.

3. Mindfully migrate data to Iceberg over time

To continue heading towards an open lakehouse architecture and reduce your operating costs and future potential switching costs, you can gradually move your tables from Snowflake storage to Iceberg tables. Regarding which tables to move, it is a best practice to rank your tables by how critical they are to your business and then start moving related tables together, starting with the least critical tables first. For instance, we recommend starting with landing raw data into Iceberg directly using Starburst or an ingestion tool like Fivetran and doing some basic cleanup and normalization (Bronze and Silver tables in medallion architecture). By doing this, you will encounter learnings and potential issues with your less critical tables. Over time, your confidence will increase, and you will start migrating more critical tables to Iceberg, e.g. Gold tables.

4. Consider an Icehouse architecture

An Icehouse architecture is based on Trino for high-performance and scale SQL querying (read and write) and Iceberg for storage; it is complimentary to your Snowflake + Iceberg solution and can help you significantly lower your operating costs. In fact, as soon as you have your first Iceberg table, you can pair it with Trino and a SaaS Icehouse implementation like Starburst Galaxy to start querying your data in Iceberg. A good first way to use an Icehouse architecture is for transformations (as mentioned above), followed by moving your primary SQL workloads like applications and dashboards to directly use your Icehouse.

Last week was packed with exciting announcements for anyone following data and AI, and we are excited to see much of our work validated by these announcements. Now, let the actual work to unlock insights from all data genuinely begin.

¹Snowflake Notebooks: a development interface for Python, SQL, and Markdown; Snowflake Cortex AI: a way to create AI-powered applications; Snowflake Trail: end-to-end observability within Snowflake

²Confirmed June 12, 2024: https://www.databricks.com/company/newsroom/press-releases/databricks-open-sources-unity-catalog-creating-industrys-only-open

³https://www.crn.com/news/cloud/2024/amazon-ceo-85-percent-of-it-spend-remains-on-premises-gen-ai-will-fuel-aws-cloud-sales