Delta Lake was initially developed by Databricks and by 2019 evolved to become an open source project. Since then, they’ve created a few key features.
Delta Lake allows classical Data Manipulation Language (DML) statements (INSERT, UPDATE, DELETE, and MERGE) and manages the commit of those operations as an ACID-compliant transaction. It does this by maintaining a transaction log of changes (also known as the DeltaLog) which itself is the enabler for performing time-travel querying since every change creates a new version of the table.
Not only is it already running in the cloud, but you can get started for free!!
You can then leverage its Great Lakes connectivity to use one of the popular public cloud object stores. For a little help with that, check out my querying aviation data in the cloud (leveraging starburst galaxy) post that sets up an Amazon S3 catalog.
Here I am ready to go now! How about you?
As you might be able to see above (if you have stronger glasses than me!), I ran a couple of statements to create a schema to play in and then to root myself to that schema.
-- my catalog is named 'mycloud', so update accordingly CREATE SCHEMA mycloud.dlblog1; USE mycloud.dlblog1;
Let’s create a new table. Notice the type property in the WITH clause — Great Lakes connectivity allow multiple table formats to be created and used. I’m just telling it to leverage Delta Lake.
CREATE TABLE dune_characters ( id integer, name varchar(55), notes varchar(255) ) WITH (type = 'delta');
Now, add a few records.
INSERT INTO dune_characters (id, name, notes) VALUES (101, 'Leto', 'Ruler of House Atreides'), (102, 'Jessica', 'Concubine of the Duke');
Verify these records are present.
Ooof! I forgot the star of the book/movie!! Let’s get him inserted, too.
INSERT INTO dune_characters (id, name, notes) VALUES (103, 'Paul', 'Son of Leto');
Instead of just querying the table again to verify it has 3 rows now, here is a peek into the S3 bucket where the table’s data is stored. You can see 2 datasets there (1 from the first INSERT and the 2nd from the next one).
And that _delta_log folder? Yep, it is the mythical DeltaLog we touched upon earlier. Let’s take a peek into it!
Basically, we have 3 different JSON files. One for each version. Yes, the
CREATE TABLE created the
000 version. Here are a few relevant snippets from each of these files.
|Table Version||Relevant Snippets|
Sparked (pun definitely intended) your interest? Go back up and watch that hour-long video for more. I promise it can be fun if your the science fair kind of person that enjoys seeing the gory details.
For the rest of us, this is plenty for now!
Blog post originally published, here.