Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Apache Iceberg is an open-source table format that adds data warehouse-level capabilities to a traditional data lake. One of the Apache Software Foundation’s open-source projects, the Iceberg table format enables more efficient petabyte-scale data processing by creating an abstracted metadata layer that describes the files in a data lake’s object storage.
Choosing Iceberg as the table format for an open lakehouse provides several benefits:
An open data lakehouse comprises four elements: commodity storage, open file formats, open table formats, and high-performance query engines.
Commodity storage: You can build a lakehouse on storage platforms like Amazon S3. These efficient cloud services provide efficient ways to store different data types at scale.
Open file formats: Various open file formats like Avro, Parquet, and ORC let you optimize how you collect and store data in your lake.
Open table formats: Iceberg is the open table format of the data lakehouse architecture. Its rich metadata files and analytics-optimized structure allow query engines to run more efficiently.
Query engines: High-performance query engines like Apache’s Spark and Trino (formerly PrestoSQL) are optimized for big data analytics.
Iceberg tables use metadata, snapshots, and manifests to track individual data files. Any changes to the table are made to these components rather than the data itself. This approach gives Iceberg more robust functionality than predecessors like Apache Hive.
Like a database’s SQL tables, Metadata files describe the table’s schema, partitioning, and other information. They also contain snapshots of the table’s data files. Iceberg generates a new snapshot anytime the table’s state changes. As a result, these tables retain a complete history of state changes.
A manifest file describes the table’s data files. A snapshot may contain multiple manifest files it documents in a manifest list. Iceberg’s approach to manifest files and lists reduces overhead and makes queries more efficient.
Complete with the enterprise-grade features and support you need
A talk by Ryan Blue, Co-Creator of Apache Iceberg
Shopify's story of moving to Iceberg from Hive with Trino
Check out their performance gains
Inspecting Trino on Ice
“The time it took from reporting Trino + Iceberg bugs to deployed fixes was very fast and demonstrated the commitment Trino has to being a leader in Iceberg adoption across the various compute engines. ”
Marc Laforet, Shopify
Companies with petabyte-scale data ecosystems are the primary users of Iceberg. Large datasets place enormous demands on information architectures. Iceberg’s use cases simplify big data management on a data lake.
Data lakes can store unstructured and structured data, making them better suited to modern data analytics demands. However, data lakes cannot match a data warehouse platform’s full suite of analytics capabilities. As a result, companies often layer data warehouses on top of the data lakes. Besides the added complexity, this approach increases costs and the risks associated with data moves and duplication.
Iceberg’s open table format adds rich metadata and query engine compatibility to blend a warehouse’s analytics capabilities with a lake’s storage efficiency. This simpler architecture eliminates data warehouses and turns a lake into the enterprise’s central analytics resource.
Once generated, data takes time to settle. For example, data associated with customer orders can change anywhere from their initial creation to the end of a return window. Regulated personal data must be purged at intervals set by compliance policies. Frequent, small changes to large datasets place enormous workloads on data systems.
Iceberg’s design allows these granular changes to occur without imposing performance penalties.
Enterprise applications and users often need access to the same data simultaneously. However, allowing concurrent access can be risky. When users of a dataset read and write at the same time, the resulting inconsistencies may contaminate downstream analysis.
Iceberg isolates the lake’s raw data through metadata abstraction, instead giving users access to unique snapshots of the data table. Changes result in a new snapshot, but the users can continue using the original snapshot to preserve consistency and repeatability.
Many table formats can group data by common properties. This partitioning lets queries skip irrelevant data and return results faster at a lower cost. However, formats like Hive force users to have a deep understanding of table structure and partitioning in order to prevent errors or inaccurate query results.
Iceberg hides aspects of partitioning from users by, for example, automating the creation of partition values and avoiding irrelevant partitions. Hidden partitioning lets Iceberg partitions evolve without affecting queries.
Related reading: Iceberg Partitioning and Performance Optimizations in Trino
|Original table format
|Created for time/event series data
|Open source doesn’t support concurrent writers
|Supports ORC, Parquet, JSON, etc
|Great for streaming use cases
|Only supports Parquet
|Metadata tree is more performant using AVRO
|Partition columns must be part of table
|Copy on write & on read
|Can’t change partitioning
|Partition and table evolution
|Relies heavily on the metastore
|Table evolution, compaction, etc
|10 checkpoints every commit; so every 10th write is slower
|Associated with Databricks
|Associated with Trino
Data lakehouses are still relatively recent developments, with solutions based on Apache’s Iceberg or Hudi projects or Databricks’s Delta Table format. These three options have similar functionality, but the devil is in the details.
Developers at Netflix created Iceberg to address the challenges of using Apache Hive on the streaming service’s extensive data infrastructure.
Hive uses a subsystem called a metastore that points to a table’s data. However, it only points to the folder containing the relevant data file. That may be acceptable in a structured environment like a Hadoop-based data warehouse. Using this approach with object storage imposes stiff performance penalties.
Another performance hit comes from how Hive interacts with Hadoop, which relies on Java-based MapReduce jobs to interact with data. Few data consumers have the specialized skills in Java+MapReduce needed to query Hadoop data stores. Hive implements HiveQL to create an SQL-like approach for generating Hadoop queries. However, the Hive approach means every query command requires a translation step between HiveQL and Java.
Netflix’s developers set out to create a new table format that addressed these and other issues Hive creates when analyzing petabytes of data. Eventually, Netflix handed the project over to the Apache Software Foundation, where it has flourished.
Although both Apache open source projects, they address different aspects of the data lakehouse architecture. Whereas Iceberg is an open table format, Parquet is an open file format for creating column-oriented data files on a data lake. This structure compresses more efficiently than a row-oriented format like Avro, which reduces overall storage costs. In addition, Parquet files help speed queries by, for example, providing metadata that queries can use to skip irrelevant data.
Iceberg’s open table format lets you connect your data lakehouse to any query engine. Starburst’s modern data lake analytics platform lets you connect to any data source. Using Starburst to power the analytics of your Iceberg-based data lake makes data more manageable, optimizes compute and storage investments, and speeds time to insight for more effective decision-making.
Starburst is based on the Trino open source project’s massively parallel query engine but with optimizations designed to maximize the features of Iceberg data tables, including schema evolution, time travel, and partitioning.
Since some workloads work best with different table formats, we created Great Lakes, a connectivity feature of Starburst Galaxy. Great Lakes abstracts the details of a data lake’s table formats and file formats to simplify accessing tables, whether based on Iceberg, Hudi, or Delta Table. Starburst’s Great Lakes lets data teams optimize their data lake architectures for different use cases. Data consumers can run SQL queries without having to know the details of each table’s format.
Tutorials, Trino and Iceberg cheat sheet, Migrating hive tables
In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.
Fully-managed • Secure by design • Built on open standards
Up to $500 in usage credits included
Up to $500 in usage credits included