×

Automated maintenance for Apache Iceberg tables in Starburst Galaxy

Last Updated: February 16, 2024

This post is part of the Iceberg blog series. Read the entire series:

  1. Introduction to Apache Iceberg in Trino
  2. Iceberg Partitioning and Performance Optimizations in Trino
  3. Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
  4. Apache Iceberg Schema Evolution in Trino
  5. Apache Iceberg Time Travel & Rollbacks in Trino
  6. Automated maintenance for Apache Iceberg tables in Starburst Galaxy
  7. Improving performance with Iceberg sorted tables
  8. How to migrate your Hive tables to Apache Iceberg

    One of the great things about Trino and Iceberg is the ability to perform database type functions right inside your object storage. There are some routine maintenance that need to be performed on these tables to ensure optimal performance and removal of old snapshots. 

    In this blog, we’ll show you how easy it is to create a table-driven maintenance process for Iceberg tables. An Iceberg table will be created to store a list of tables and a python script will be used to optimize and clean up old snapshots. The script can be executed using any scheduling/orchestration tool of your choice.

    Table Definition

    First, we’ll build an Iceberg table to hold our table information with flags for each table on how we want to handle the maintenance on them:

    CREATE TABLE IF NOT EXISTS iceberg_maintenance_schedule (
        table_name VARCHAR NOT NULL,
        should_analyze INTEGER,
        last_analyzed_on TIMESTAMP(6),
        days_to_analyze INTEGER,
        columns_to_analyze ARRAY(VARCHAR),
        should_optimize INTEGER,
        last_optimized_on TIMESTAMP(6),
        days_to_optimize INTEGER,
        should_expire_snapshots INTEGER,
        retention_days_snapshots INTEGER,
        should_remove_orphan_files INTEGER,
        retention_days_orphan_files INTEGER
    )
    WITH (
       type = 'ICEBERG'
    );

    *** Note: the script below will create this table if it doesn’t exist ***

    Next, we’ll populate the table with our initial list of tables that we want to mange:

    insert into iceberg_maintenance_schedule values
    ('customer_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
    insert into iceberg_maintenance_schedule values
    ('orders_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
    insert into iceberg_maintenance_schedule values
    ('lineitem_iceberg',1,NULL,7,NULL,1,NULL,7,1,7,1,7);
    insert into iceberg_maintenance_schedule values
    ('iceberg_maintenance_schedule',1,NULL,7,NULL,1,NULL,7,1,7,1,7);

    Now we have our table populated with 3 Iceberg tables: (notice we have our own maintenance table listed in there as well, more to come on that..)

    As new Iceberg tables are created, the table values get inserted into this table along with the different options. This makes it very easy to add and remove new Iceberg tables into the maintenance process. 

    Note: This blog is a very simple example and each table could have different timings for each of the operations but we wanted to show a simple example that you can take and build upon. Adding different schedules for each table,etc.. 

    Next, we use the Trino Python client to write a script to read the tables in the iceberg_mx table and execute an optimize, delete_snapshot and analyze on them based on the values in the table above. 

    Python Script

    The python script included in this blog post will handle the following:

    1. Reading of the iceberg_mx table and processing each row at a time
    2. For each row execute the following based on the parameters from the table:
      1. Optimize
      2. Remove older snapshots
      3. Analyze
    3. Update the metadata columns. (last_optimized,etc..)

    Python script github: https://github.com/mdesmet/trino-iceberg-maintenance

    An example of running the python script:

    export NUM_WORKERS=10
    export TRINO_HOST=tnats-aws.trino.galaxy.starburst.io
    export TRINO_PORT=443
    export TRINO_USER=tnats@starburstdata.com/accountadmin
    export TRINO_PASSWORD=xxxxxxxxxxxxxxxxx
    export TRINO_CATALOG=s3lakehouse
    export TRINO_SCHEMA=demo_tpch
    /usr/bin/python3 -m trino_iceberg_maintenance

    An example of the statements running in Starburst Galaxy:

    Making Changes

    Since our maintenance table is also an Iceberg table, we can easily make modifications to this table based on our needs. For example, if we wanted to keep 14 days of customer history, we would write a simple update to our table:

    update iceberg_mx set retention_days = 14 where id = 1;

    Now, our automated process will keep 14 days of history for time travel and revert back to any of those days if needed. More information on the time travel blog post here

    Scheduling

    Scheduling the execution of the python script can be done in a variety of ways using a tool of your choice. Usually I use something like airflow or a similar orchestration tool but I found this neat little utility called Cronitor. They provide a very easy way to monitor cronjobs. The installation is very easy and for each execution of a cronjob, it provides a nice looking dashboard per job as well as different alert targets such as email and Slack.

    To get started, you follow these steps:

    1. Visit https://cronitor.io and sign up for a free account
    2. Install the cronitor program on your Linux VM or Mac

      curl https://cronitor.io/install-linux?sudo=1 -H "API-KEY: 
      <UniqueKeyTheyGive You>" | sh
    3. Now, simply run “cronitor discover” and it will go through any existing crontabs you have and ask you to name each one. 
    4. From there, you will get a nice dashboard showing you the different runs for each cronjob as well as setting up additional alert targets such as Slack. 

    Screenshot showing my Iceberg MX cronjob dashboard:

    My crontab -l: (notice, when I ran cronitor discover, it comments out my old job and adds a new one)

    # cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh

    0 0 * * * cronitor exec bOHpJi cd /home/tnats/scripts/trino-iceberg-maintenance;./run.sh

    My script runs once a day but you can adjust to weekly, etc..

    Wrapping Up

    Now, we have an automated way of performing maintenance on all of your Iceberg tables! Just insert new Iceberg tables as part of the process of adding new tables and you get to enjoy all of the benefits of this fully featured table format as well as knowing they will be performing at their peak using this automated maintenance process. 

    Do you have some suggestions or feedback? Please feel free to contact me or Michiel. 

    tom@starburst.io

    michiel.desmet@starburst.io

    Schedule a call with an expert

    Book time

    Start Free with
    Starburst Galaxy

    Up to $500 in usage credits included

    • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
    • Get up and running in less than 5 minutes
    • Easily deploy clusters in AWS, Azure and Google Cloud
    For more deployment options:
    Download Starburst Enterprise

    Please fill in all required fields and ensure you are using a valid email address.

    s