
Materialized views have become available inStarburst Galaxy for catalogs using Great Lakes connectivity. For folks who are NOT already using Starburst Galaxy — come sign up — it’s FREE — especially if you want to exercise the content in this blog post. To test this all out you will need to connect to one of the cloud object stores, too. Feel to look at my querying data in the cloud post if you need a jump start on that.
Please check out training opportunities from Starburst Academy where an enhanced version of this information will soon be available. And yes, enough with all of that, let’s DO SOMETHING!
Define Table and Query
Create a couple of schemas in a catalog aligned to your cloud object store.
CREATE SCHEMA mycloud.mv_storage;
CREATE SCHEMA mycloud.mvtest;
USE mycloud.mvtest;
Create an Apache Iceberg table and populate it.
CREATE TABLE dune_characters (
id integer,
name varchar(55),
notes varchar(255)
)
WITH (type = 'iceberg');
INSERT INTO dune_characters
(id, name, notes)
VALUES
(101, 'Leto', 'Ruler of House Atreides'),
(102, 'Jessica', 'Concubine of the Duke');
Run a very simple query to base a materialized view on.
SELECT name, notes
FROM dune_characters;
Create/Load the Materialized View
Create a materialized view with the previous defining query.
CREATE MATERIALIZED VIEW mat_view
WITH (storage_schema = 'mv_storage')
AS SELECT name, notes
FROM dune_characters;
The WITH
clause is optional, but the default behavior for the Iceberg connector writes the underlying “storage table” to the same schema. In practice, it makes sense to locate the storage table somewhere else (if only to limit consumer confusion).
If you queried mat_view
now the results would be as expected, but the query would actually access the table in the defining query. The following command will (re)build the underlying storage table.
REFRESH MATERIALIZED VIEW mat_view;
You can now see the dune_characters
table and the mat_view
materialized view in the mvtest
schema as well as the storage table located in the mv_storage
schema.
Running a query on the materialized view returns the expected results.
SELECT * FROM mat_view;
To fully verify that the storage table was accessed instead of the original table, click on Query details from the screen shot above and then the Advanced tab under Query ID heading on the page that renders.
As you can see above, the storage table that starts with st_
from the mv_storage
schema is being accessed instead of the dune_characters
table from the defining query.
Addressing Staleness
Unlike Starburst Enterprise’s Hive connector, there is no automatic or incremental refresh options for the Iceberg connector’s materialized views. This suggests that the data engineering pipeline that makes modifications to the underlying table should consider this and run the REFRESH
command when most appropriate.
A common concern with materialized views on the data lake is the fact that they can actually return stale data. The Iceberg-backed materialized views that Starburst Galaxy utilizes are susceptible to the concern UNLESS all tables in the defining query are are using the Iceberg table format themselves.
If you are unaware that Iceberg keeps track of versions of the table that can be used for time-travel queries and rollbacks, please check out the Introduction to Apache Iceberg in Trino blog series.
Times like this call for the old RTFM adage.
Refreshing a materialized view also stores the snapshot-ids of all Iceberg tables that are part of the materialized view’s query in the materialized view metadata. When the materialized view is queried, the snapshot-ids are used to check if the data in the storage table is up to date. If the data is outdated, the materialized view behaves like a normal view, and the data is queried directly from the base tables. Detecting outdated data is possible only when the materialized view uses Iceberg tables only, or when it uses mix of Iceberg and non-Iceberg tables but some Iceberg tables are outdated. When the materialized view is based on non-Iceberg tables, querying it can return outdated data, since the connector has no information whether the underlying non-Iceberg tables have changed.
https://docs.starburst.io/latest/connector/iceberg.html#materialized-views
Let’s test it out by adding another row to the table in the defining query and then querying mat_view
again.
INSERT INTO dune_characters
(id, name, notes)
VALUES
(103, 'Paul', 'Son of Leto');
SELECT * FROM mat_view;
As you can see from the output, even the new row surfaced. Remember, this is only happening since all of the tables in the defining query are created with the Iceberg table format. You can verify the actual table was accessed from the web UI.
A cool feature for sure, but one you need to understand as it wasn’t the behavior I expected initially. Since the underlying Trino Iceberg connector is open-source (and they’re ALWAYS accepting contributions!), I’ve added an item on my potential development page to create an allow_staleness
property for scenarios where you might not want this behavior.
Trust, but Verify…
For those of us who like to trust what we hear, but (maybe too) often want to “validate” what we hear from others by just trying it out… you could rerun everything above, but this time create the dune_characters
table using WITH (type = 'hive')
instead. 😉
What about Starburst Enterprise?
Great news!! This all works on Starburst Enterprise, too, as called out in the Starburst Iceberg connector!
This tutorial was originally published by Lester Martin.