Starburst and Databricks Collaborate on the Trino Delta Lake Connector

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

This blog was co-authored by Claudius Li, Product Manager at Starburst, and Joe Lodin, Information Engineer at Starburst.

Starburst recently donated the Delta Lake connector to Trino. We released the initial Delta Lake connector for Starburst Enterprise users in April 2020. The connector started out with read capabilities, but we’ve consistently expanded functionality to add write capabilities, data management capabilities, and significant performance enhancements.

Starburst and Data Lake Data Lake Connector Diagram

Over the last couple of weeks, we ported the Delta Lake connector and its documentation to Trino, which was recently shipped in Trino 373. In the Trino Community Broadcast episode 34, we showcased the new connector. You can watch the demo video to see the connector in action, and you can also follow along with our instructions. Now, with Trino and Delta Lake, anyone can create and query a lakehouse using 100% open-source software.

However, all of this is just the beginning. We have a lot more plans, as Starburst and Databricks are teaming up to make the connector even better.

Galactic lake house and a stronger community

With full support for Delta Lake in our Great Lakes Connector to Starburst Galaxy, customers can set up a fully functional lake house in minutes with just a few clicks. Starburst Galaxy users will be able to connect to either an existing lake house or any cloud-native object storage to instantly create a new lake house.

From open source support to enterprise-grade support

With the Delta Lake connector in open source Trino, and open source Delta Lake available, anybody can get started running their own lake house. Starburst and Databricks are working with both open source communities to gather feedback, fix issues, improve performance, and add new features.

“The decision to contribute the Delta Lake connector to Trino is an important milestone for Starburst. It reinforces our commitment to the large open source community around Trino. Together with other open-source communities like Delta Lake, we can deliver tremendous features to our users. We look forward to learning about their usage and problems, and making Trino even better for everyone,” said Matt Fuller, VP of Product at Starburst.

Databricks is embracing and supporting the Delta Lake open source community in a similar fashion, as Michael Armbrust, Distinguished Engineer at Databricks, mentions: “The Starburst team shares our commitment to the open-source community, and this is an amazing starting point for future collaborations between Trino and Delta Lake! We’re excited to see the Delta Lake connector flourish with the Trino community.”

All of this work flows back to both communities, as Starburst and Databricks contribute fully supported enterprise features to the core open source projects.

Starburst Enterprise and Databricks customers automatically get the upgrade to the new connector in an upcoming release.

The road ahead

Starburst is continuing to iterate on the Delta Lake connector and related features. We work constantly to make all of our connectors faster, more reliable, and more flexible. This commitment applies to any code in the core Trino engine, in the Trino connectors, and to any additional connectors available to Starburst Enterprise and Starburst Galaxy users.

Databricks is building a standalone Delta Lake reader library. This library improves performance and reliability by integrating with Delta Lake, thereby adapting to any interface, protocol, or semantic changes. Starburst and Databricks are working closely to make sure the reader library meets the needs of Trino and other Delta Lake users.

The secret to reliable software is testing. To collaborate on ensuring the reliability of these complex systems, Databricks has graciously offered to donate a test environment to the Trino project. Starburst will use this to make sure that Trino and Databricks can correctly read data written by the other engine as part of our continuous integration setup.

What next?

We are really excited to bring all the benefits of Trino and Delta Lake to our communities and create a healthy ecosystem of collaboration with all of our users, customers, and contributors alike.

Starburst and Databricks are eager to hear about enhancements you think we should add to our connectors. Chat with us on the Trino Slack, and consider sending a pull request with your improvements.

Since Starburst and Databricks are collaborating on this Delta Lake connector, I know there are going to be a lot of questions about how these pieces fit together. People want to know if they are being locked in or if they are truly getting the optionality we always talk about.

Here is a quick FAQ to clear the air and explain what this means for your data architecture.

Trino and Delta Lake FAQ

What exactly is the Trino Delta Lake connector?

The Delta Lake connector is the bridge that allows Trino to read from and write to tables stored in the Delta Lake format. Delta Lake is an open-source storage layer that adds ACID transactions and scalable metadata management to your data lake. By donating this connector to the Trino community, we are enabling anyone to build a high-performance lakehouse using entirely open-source software.

Do I need to be a Databricks customer to use this?

No. That is the beauty of open standards. While Databricks founded the Delta Lake project, it is a fully open-source format. This connector allows Trino users to query Delta Lake tables wherever they reside, whether on Amazon S3, Azure Storage, or Google Cloud Storage. You get the benefits of the Delta format without being locked into a single compute engine.

How does Delta Lake differ from the Hive connector?

The Hive connector has been the workhorse of the data lake for a decade, but it has limitations in terms of data consistency and performance at scale. Delta Lake solves these problems by using a transaction log to manage files. This means Trino can handle concurrent reads and writes more effectively. While Hive is still a critical part of many architectures, Delta Lake represents the next generation of data organization on the lake.

Can Trino write to Delta Lake tables or just read them?

Trino has full support for both reading and writing. This includes the ability to create new tables, insert data, and perform updates and deletes. We are working closely with the Databricks team to ensure that Trino remains compatible with the Delta Lake specification so that data written by Trino can be read by other engines and vice versa.

What is the roadmap for the connector?

The road ahead is focused on performance and reliability. Databricks is developing a standalone Delta Lake reader library that will help engines like Trino stay in sync with the Delta format as it evolves. We are also setting up shared testing environments to ensure that Starburst, Trino, and Databricks all work together seamlessly. You can expect to see even deeper integration and faster query times as this collaboration matures.

Schedule a call with an expert

Book time