Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
This allows data to be managed and optimized more efficiently. At the same time, the increased structure allows for additional features. Hive introduced the first data lake table format. However, this format had a number of functionality limitations owing to its architecture. In recent years, new table formats have pushed data lakes in new directions, increasing both functionality and performance. Importantly, these table formats allow data lakes to achieve some of the efficiencies and compliance standards more typically associated with data warehouses or databases, while retaining the versatility of a data lake.
This includes enhanced ACID compliance, the ability to record transactional data efficiently, improved scalability, and the ability to update or delete records. These advancements are so significant that it is often said that data lakes using these technologies become more like data lakehouses, mixing the versatility of data lakes to handle raw and semi-structured data with the ability to process transactional workloads like a data warehouse.
Let’s explore how newer table formats have improved the versatility of data lakes by adding additional functionality. Particular attention will be paid to improvements in transactional data collection within data lakes.
Modern table formats are designed to provide enhanced performance and compliance capabilities for data lakes using cloud-based object storage.
While each table format varies, they all extend the features of a data lake in similar ways.
Data lakes use either HDFS or object storage. Both hold data in an immutable format. These have not typically provided an easy way to update files incrementally. If you consider that a database typically includes the ability to Create, Read, Update, or Delete (CRUD), a data lake has often only included the first two. Modern table formats help fix this by allowing the ability to update and delete records.
Data lakes tend to easily grow in size, and many are very, very large. With this, comes the need to scale their analytic capabilities to match. Newer table formats allow increased scalability when compared to Hive by introducing a new way of recording data at the file level. This is a marked improvement over Hive’s record keeping approach, which organized data by folder. This means that if a query requires data on a specific subset of data, it can search the specific files containing the information rather than searching the whole folder. This vastly improves performance and efficiency.
With the inclusion of ACID capabilities in table formats like Iceberg and Delta Lake, users can now achieve a level of transactional awareness within a data lake. This does not necessarily mean that a data lake would be a replacement for an OLTP system. However, it does ensure that groups of updates transactionally complete together or are rolled back if they cannot be completed. This is helpful with some of today’s evolving ETL pipelines.
Apache Iceberg is an open source table format used to structure the data held in data lakes. Like the other table formats listed, it was developed to solve the challenges of performance, data modification, and CRUD operations in the data lake. It can be used with either HDFS or any object-based cloud platform, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.
Notably, Iceberg also offers schema evolution, schema partitioning, and time travel. This allows users to apply and update schemas, apply and update partitions, and enact version control to roll back changes to a system to a previous state. All of these adaptations push the data lake to a new level of functionality and create new use-cases for data lakes.
Delta Lake is an open source framework developed by Databricks. Like other modern table formats, it employs file-level listings, improving the speed of queries considerably compared to the directory-level listing of Hive.
Like Iceberg, Delta Lake offers enhanced CRUD operations, including the ability to update and delete records in a data lake which would previously have been immutable. It is ACID compliant and often used in transactional systems. This use-case makes data lakes a viable replacement for traditional transactional databases, while retaining the cost and storage benefits of other data lakes.
Delta Lake can be used with Starburst via the Delta Lake connector.
Apache Hudi is another table format, which is used less often than Iceberg or Delta Lake. It addresses some of the same problems discussed above.
Architecturally, modern table formats are composed of a set of hierarchically structured metadata files. These files capture changes in the state of the data in the data lake. You can think of a table format as a kind of database transaction log, outlining all of the changes over the life of the data lake. This metadata is stored in a structured, readily accessible format.
The image below shows how metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.
To achieve this, modern table formats create records pointing to individual metadata file locations. This file is known as a manifest file, and includes metadata containing information about the table at a given point in time. This acts as a kind of snapshot of the table, detailing the points at which a change is made. Multiple manifest files are stored in Manifest lists.
In the image below, changes in the Data layer have been detected in the Metadata layer. A new Manifest file and corresponding Manifest list has been created to capture these changes.
Manifest files and Manifest lists provide the ability to record an accurate, up-to-date account of the changes that have occurred over time. This includes inserts, deletions, updates, schema migrations, and partition changes. The changes themselves are stored in Metadata files known as Snapshots. Each snapshot is like a slice in time, allowing the dataset to be queried as it was multiple instances or rolled back to a previous state.
The image below shows how the changes in the Data layer have created a new Snapshot file, Snapshot 1. The original Snapshot file, Snapshot 0, is also retained. This creates a series of snapshots, each mapping changes to the data and recording those changes in the Metadata layer.
Up to $500 in usage credits included
Up to $500 in usage credits included