TL;DR – There is now Starburst Enterprise Databricks Delta Lake compatibility.
The big data ecosystem has many components but the one that stands out is the data lake. This has changed through the last few years from mostly on-premises Hadoop clusters to low cost object stores (AWS S3, Azure Blob) in the cloud with different computing clusters interacting with this storage (separation of compute and storage).
A challenge that has plagued Hadoop and object stores is that they are immutable. This means objects (often referred to as files) cannot be modified. This becomes a challenge if some of the data inside the object needs to be changed. For example, an object that contains customer information that needs to be modified to reflect a recent address change. There was no common practice for handling these types of operations with companies struggling to create and maintain custom code to handle these scenarios. In addition, with the additional requirements of GDPR, companies need to implement erasure across all of their data stores, adhering to these requirements is a challenge when related to data lakes.
Databricks saw the need to not only provide ACID type transactions (update, delete and merge operations) on these immutable object stores, they also wanted to add performance optimization and object management. All this while maintaining read level consistency (meaning operations to this data will not affect the simultaneous reading of the same data – a feature databases have had for years) from multiple clusters.
For Databricks customers, the solution is Delta Lake
Delta Lake provides not only ACID transactions, it also adds optimization features such as converting small files into larger ones often created by streaming or frequent file ingest and also Z-Ordering which provides enhanced performance gains on some queries.
As more and more customers start to use Delta Lake within their data lakes, providing a high concurrency, interactive SQL engine such as Starburst Enterprise is top on our customer’s list.
Currently, Starburst supports the reading of Delta tables in two ways:
- Vacuum tables – If a table is “vacuumed” to retain 0 days, this places the Delta table in a “current” state which allows Starburst Enterprise to cleanly read the table. This is by far the most performant method to query Delta Lake tables.
- Manifest files – Databricks has the functionality to create a “manifest” file. (currently in private preview) This file lists the current data files that belong to a Delta table. Starburst Enterprise can read this file with no configuration changes. This is supported with the latest version of Starburst Enterprise.
Native Starburst Enterprise Delta Lake Reader
Starburst is currently working on a native Delta Lake reader. This won’t require a manifest file which can be cumbersome to create before Enterprise is able to read the data in a Delta Lake. A native reader will be even more performant and will work seamlessly with Delta tables.
Stay tuned for further updates on this development!
The diagram below illustrates a common use case for Starburst Enterprise reading Delta files. Data sources are ingested into a “Delta Lake” and can be immediately read from Starburst Enterprise. This enables our customers to benefit from all of the features of Delta (performance optimizations, Z-Ordering, ACID transactions and transactional guarantees) as well as the performance and high concurrency of the Starburst Enterprise SQL engine.
The data lake has changed dramatically in the last few years. With more and more companies merging the lines between a data lake and a data warehouse, Delta Lake makes this become more of a reality for companies by bringing database functionality to the data lake.
Starburst is excited about the quick adoption in the industry around Delta Lake. We believe this will enable companies creating or migrating their data lakes to the cloud the ability to finally provide the value that they were promised years ago in the Hadoop excitement.