Automated maintenance for Apache Iceberg tables in Starburst Galaxy
This post is part of the Iceberg blog series. Read the entire series:
- Introduction to Apache Iceberg in Trino
- Iceberg Partitioning and Performance Optimizations in Trino
- Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
- Apache Iceberg Schema Evolution in Trino
- Apache Iceberg Time Travel & Rollbacks in Trino
- Automated maintenance for Apache Iceberg tables in Starburst Galaxy
One of the great things about Trino and Iceberg is the ability to perform database type functions right inside your object storage. There are some routine maintenance that need to be performed on these tables to ensure optimal performance and removal of old snapshots.
In this blog, we’ll show you how easy it is to create a table-driven maintenance process for Iceberg tables. An Iceberg table will be created to store a list of tables and a python script will be used to optimize and clean up old snapshots. The script can be executed using any scheduling/orchestration tool of your choice.
First, we’ll build an Iceberg table to hold our table information with flags for each table on how we want to handle the maintenance on them:
CREATE TABLE IF NOT EXISTS iceberg_maintenance_schedule ( table_name VARCHAR NOT NULL, should_analyze INTEGER, last_analyzed_on TIMESTAMP(6), days_to_analyze INTEGER, columns_to_analyze ARRAY(VARCHAR), should_optimize INTEGER, last_optimized_on TIMESTAMP(6), days_to_optimize INTEGER, should_expire_snapshots INTEGER, retention_days_snapshots INTEGER, should_remove_orphan_files INTEGER, retention_days_orphan_files INTEGER ) WITH ( type = 'ICEBERG' );
*** Note: the script below will create this table if it doesn’t exist ***
Next, we’ll populate the table with our initial list of tables that we want to mange:
insert into iceberg_maintenance_schedule values ('customer_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7); insert into iceberg_maintenance_schedule values ('orders_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7); insert into iceberg_maintenance_schedule values ('lineitem_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7); insert into iceberg_maintenance_schedule values ('iceberg_maintenance_schedule',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
Now we have our table populated with 3 Iceberg tables: (notice we have our own maintenance table listed in there as well, more to come on that..)
As new Iceberg tables are created, the table values get inserted into this table along with the different options. This makes it very easy to add and remove new Iceberg tables into the maintenance process.
Note: This blog is a very simple example and each table could have different timings for each of the operations but we wanted to show a simple example that you can take and build upon. Adding different schedules for each table,etc..
Next, we use the Trino Python client to write a script to read the tables in the iceberg_mx table and execute an optimize, delete_snapshot and analyze on them based on the values in the table above.
The python script included in this blog post will handle the following:
- Reading of the iceberg_mx table and processing each row at a time
- For each row execute the following based on the parameters from the table:
- Remove older snapshots
- Update the metadata columns. (last_optimized,etc..)
Python script github: https://github.com/mdesmet/trino-iceberg-maintenance
An example of running the python script:
export NUM_WORKERS=10 export TRINO_HOST=tnats-aws.trino.galaxy.starburst.io export TRINO_PORT=443 export TRINO_USERemail@example.com/accountadmin export TRINO_PASSWORD=xxxxxxxxxxxxxxxxx export TRINO_CATALOG=s3lakehouse export TRINO_SCHEMA=demo_tpch /usr/bin/python3 -m trino_iceberg_maintenance
An example of the statements running in Starburst Galaxy:
Since our maintenance table is also an Iceberg table, we can easily make modifications to this table based on our needs. For example, if we wanted to keep 14 days of customer history, we would write a simple update to our table:
update iceberg_mx set retention_days = 14 where id = 1;
Now, our automated process will keep 14 days of history for time travel and revert back to any of those days if needed. More information on the time travel blog post here.
Scheduling the execution of the python script can be done in a variety of ways using a tool of your choice. Usually I use something like airflow or a similar orchestration tool but I found this neat little utility called Cronitor. They provide a very easy way to monitor cronjobs. The installation is very easy and for each execution of a cronjob, it provides a nice looking dashboard per job as well as different alert targets such as email and Slack.
To get started, you follow these steps:
- Visit https://cronitor.io and sign up for a free account
- Install the cronitor program on your Linux VM or Mac
curl https://cronitor.io/install-linux?sudo=1 -H "API-KEY: <UniqueKeyTheyGive You>" | sh
- Now, simply run “cronitor discover” and it will go through any existing crontabs you have and ask you to name each one.
- From there, you will get a nice dashboard showing you the different runs for each cronjob as well as setting up additional alert targets such as Slack.
Screenshot showing my Iceberg MX cronjob dashboard:
My crontab -l: (notice, when I ran cronitor discover, it comments out my old job and adds a new one)
# cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh
0 0 * * * cronitor exec bOHpJi cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh
My script runs once a day but you can adjust to weekly, etc..
Now, we have an automated way of performing maintenance on all of your Iceberg tables! Just insert new Iceberg tables as part of the process of adding new tables and you get to enjoy all of the benefits of this fully featured table format as well as knowing they will be performing at their peak using this automated maintenance process.
Do you have some suggestions or feedback? Please feel free to contact me or Michiel.