Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Turning to open source lets companies build an affordable yet performant analytics infrastructure that keeps pace with data’s rapid growth.
This article will introduce the open data lakehouse concept and its use cases, as well as explain how Apache Iceberg and Trino combine to form a powerful analytics resource.
An open data lakehouse is a data analytics architecture that combines a data lake’s cost-effective storage with a data warehouse’s robust analytics. These lakehouses combine open-source table formats, file formats, and query engines on commodity cloud services like AWS and Azure to make big data analytics scalable and accessible.
Artificial intelligence and machine learning algorithms have become critical sources of competitive advantage. Leveraging these data science tools generates innovative products and streamlines business processes. Increasingly, AI/ML projects run into the limitations of traditional data warehouses and data lakes.
Data scientists need more varied datasets than warehouses can provide. A reliance on structured data misses key patterns and insights within vast quantities of a company’s unstructured data.
Data lakes can hold this varied data, but the objects they store have limited metadata compared to the files in a warehouse. Furthermore, lakes can’t match a data warehouse’s rich exploration and discovery resources.
Whether pulling data from other sources to compensate for a warehouse’s limitations or extracting data from a data lake’s repository, a data science project will require complex ETL pipelines and extensive data engineering resources.
An open data lakehouse provides a high-performance solution for big data processing. Scientists can easily explore structured and unstructured datasets thanks to rich metadata and powerful query engines. These accessible tools reduce the reliance on over-tasked data teams, speeding the development of new artificial intelligence applications.
An open data lakehouse platform builds upon commercial cloud object storage services but otherwise draws from the open-source ecosystem to construct a scalable, performant analytics solution. The three open-source components are file formats, table formats, and compute engines.
Amazon S3, Azure Blob Storage, and other cloud platforms provide commodity data storage at petabyte scales. By decoupling storage from compute, data teams can optimize the costs and performance of each independently. In addition, these cloud platforms are also more flexible, scalable, and affordable than on-premises storage infrastructure.
Open file formats define how a lakehouse writes and reads data. Columnar file formats structure data in ways that enhance query performance, providing detailed metadata and indexes that queries use to skip irrelevant data. Examples of open file formats include Apache Parquet and ORC.
Open table formats add an abstraction layer to a data lake’s sparse metadata, creating warehouse-like storage with structured and unstructured data. Table formats define the schema and partitions of every table and describe the files they contain. By providing a separate reference for this information, tables let queries avoid opening every file’s header and instead go to the most relevant files. Delta Lake and Apache Iceberg are commonly used open table formats.
Open compute engines are the lakehouse components that elevate big data analytics far beyond a conventional warehouse’s capabilities. Designed for massive parallelization in cloud environments, these compute engines can process large datasets quickly while balancing compute costs. As a result, lakehouses can support streaming ingestion and near real-time analytics, at the same time giving data consumers access to database-like functionality. Frequently used open compute engines include Trino and Apache Spark.
Many organizations use Trino with their data lakehouses because its massively parallel, distributed SQL query engine provides a unique combination of performance, cost-effectiveness, and accessibility. Facebook originally developed Trino to improve query results in the Hadoop ecosystem, but it now works on multiple data platforms.
Common Trino use cases include:
Unlike the proprietary implementation of a data warehouse, Trino uses ANSI-standard SQL to maximize analytics accessibility. SQL-compatible business intelligence applications like Tableau can easily return data from the lakehouse. Exploration and data extraction becomes easier when scientists can write standard SQL statements in Python or other programming languages. Engineers can quickly develop dashboards and other data products for the least technical users. Up and down the organization, Trino lets data consumers conduct interactive analysis of large datasets.
Trino connectors eliminate data silos by federating data sources across the company so a single Trino query can access data lakes, relational databases, and streaming sources. Besides streamlining query development, this federation lets data teams optimize lakehouse storage for frequently accessed data without isolating potentially valuable data held in other systems.
Parallelization at scale and query optimizations make Trino an ideal solution for performing big data analytics on object storage. Trino can push queries down to source systems to reduce compute costs and leverage the source’s indexes. Dynamic filtering lets Trino skip data that the query would end up filtering. And a cost-based optimizer distributes compute loads to balance cost with performance.
Data ingestion and other workflows that require complex ETL pipelines typically run in overnight batches because they take so long and consume significant resources. Trino accelerates and simplifies these pipelines by letting engineers use standard SQL statements within a single system to query multiple data sources. Besides streamlining pipelines, faster batch processing speeds an ad hoc research project’s time to insight.
Apache Iceberg is a highly-performant open table format that brings data warehousing functionality to a data lake’s object storage and integrates with modern query engines like Trino.
Developers at Netflix created Iceberg’s table format to address the challenges of working with Hive. Basic data management, like deletes and updates, had become increasingly difficult, as had meeting data governance demands. A single change required overwriting huge datasets.
Open table formats like Iceberg can better meet modern analytics needs than older technologies like Hive. Some benefits include:
An Iceberg table’s catalog provides a central starting place for queries to find metadata without accessing files individually. The catalog points to table metadata files where the query can find schema, file metadata, and other information.
Open table formats like Iceberg leave access control to the open compute engines, which leverage table metadata to secure access, protect data, and apply privacy rules.
Open table formats help eliminate vendor lock-in and make data more portable. They work with multiple compute engines, so companies are not tied to one vendor’s product.
Open table formats are also less sensitive to how data changes over time. Schema evolution lets these tables evolve without requiring massive — and expensive — rewrites.
Data warehouse benefits
Data lake benefits
Trino does more than power the analytics capabilities of data lakehouses. By federating data sources beyond the lakehouse’s object storage, Trino makes an enterprise’s entire data infrastructure part of the data lakehouse to create a virtual repository anyone can access.
Starburst Galaxy uses Trino to improve modern data lakehouse architectures, becoming a single point of access to all data formats stored in any data source. Galaxy’s enhancements, like smart indexing and caching, accelerate query performance. Granular access controls provide the granular governance enforcement needed to deliver a self-service model that makes data more accessible and secure.
Starburst Galaxy lets companies implement Trino on their data lakehouses without worrying about the open-source software’s operational aspects.
Up to $500 in usage credits included