Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Unlike text-based formats, these binary formats optimize reading and writing workloads of distributed petabyte-scale datasets while reducing storage costs in cloud-based data lakes and open data warehouses. Open file formats significantly impact query performance and the effectiveness of enterprise analytics.
Last updated: January 29, 2024
This article will introduce the three primary big data file formats — Avro, ORC, and Parquet — and explain how the right format drives efficient, performant open data warehouses or an open data lakehouse.
An open file format is a specification for the way data gets written to storage. Designed for large-scale, distributed storage systems, an open file format stores data in ways that reduce reading or writing overhead while providing efficient data storage. With open source, companies can avoid vendor lock-in and make their data more portable.
Unlike human-readable text formats like JSON or CSV, open file formats create machine-readable binary files that are more efficient to store and process. The way a file format records its data determines how efficiently it uses storage and the data’s accessibility.
To help understand how file formats store data, let’s consider a table of baseball stats:
A comma-delimited CSV file would store the data like this
Text files like these are easy to create and manage for small datasets, such as when transferring data to and from an Excel spreadsheet. However, storing data this way becomes untenable as data volumes grow.
Transactional data processing systems record financial transactions, web traffic, and other operational activity. These systems must reliably read and write data in real-time. For that, they require a file format optimized for record processing.
The data associated with each record is stored next to each other. In the case of our baseball stats, row-based storage might look like this:
1,John Doe,110,432,98,134,2,Alan Buck,85,261,34,72,3,George Deere,32,57,20,21
Although harder for us to read, this approach streamlines record retrieval. The system can go to the storage location of the record’s first data element and read the rest sequentially.
The data labels are typically kept separately in the file’s header or footer, along with other metadata that define the file’s schema.
Although writing and reading row-based files happen incredibly quickly, they do not work as well in an analytics context since the system must load every record to run a query. For example, calculating a team’s total number of hits requires reading each player’s record into storage and extracting that player’s hit data. That workload is trivial at the scale of a baseball team, but imagine instead you are analyzing a month of financial transactions at a global bank. Poor query performance is the main reason companies use data warehouses rather than transactional databases for business analysis.
Data warehouses can’t match transactional databases for raw speed, but they make up for it in query efficiency. Their file formats contribute to this efficiency by using a column-oriented format to store data for each column next to each other like this:
John Doe, Alan Buck, George Deere
Queries usually process data within a table’s columns. Queries can locate and read data quickly when files use a columnar format. To calculate the average number of hits in our team’s data, a query simply extracts the “Hits” column and does the math.
Row groups may hold text strings, integers, boolean values, and other data types within the same record. That diversity limits the available compression techniques and how much space they can squeeze from the records. Columnar storage formats, on the other hand, group data of the same type, allowing more efficient compression schemes.
Data engineering teams must evaluate their use cases before choosing a file format. For example, what is the frequency of reads and writes? A warehouse’s data gets written once upon arrival through ETL data pipelines. After that, data may get read frequently but rarely gets re-written. By contrast, an operational system’s relational database must handle extreme read and write volumes.
As we’ve discussed, the way columnar file formats store similar data together makes them inherently query-friendly. But how a file gets written to storage is not the only way a format speeds queries. The design of modern file formats helps screen irrelevant data so the query can process only the data it needs.
File formats include metadata that detail the file schema, data structure, storage locations, and other information about the data they hold. Self-describing files provide queries with the information needed to retrieve the data they contain without referencing a separate database.
Query engines like Apache Spark can use this file metadata in predicate pushdown filters to reduce the data it must process. A query looking for data generated within a specific date range would use predicate pushdown to ignore files whose contents are outside that range.
How files are written can further enhance a query’s ability to skip irrelevant data. Sorting data before writing will group it in ways that speed pushdown queries. For example, sorting by date will make it easier to ignore out-of-range data.
After finding and reading the data, a query must decompress it before processing. Columnar file formats can use the most efficient encoding schemes for each column of data, allowing query engines to extract the data quickly.
Finally, open file formats are designed for today’s distributed cloud storage systems. They can be split across many disks, allowing query engines to use parallel processing at massive scales.
Apache Avro is a row-based file format best suited for write-intensive operations or when data formats may change over time. Avro data serialization is binary, allowing for cost-effective storage of large amounts of data. However, this self-describing format uses easily-ready JSON to store its schema, making Avro files easier to manage in Python and other languages with JSON libraries.
Of the three main file formats, Avro has the strongest support for schema evolution. If data changes over time, schema changes like adding or modifying fields are easy to implement.
Avro’s superior write performance makes it useful for ingesting data into a data lake and for stream-processing platforms like Kafka.
Optimized Row Columnar (ORC) replaced Hadoop’s original Record Columnar File (RCFile) format to improve Hive’s SQL-like query service. This column-based format supports Hive’s data types but is not dependent on the Hive Metastore. As a result, ORC is a default choice for data warehouses that use Hive for analytical querying. ORC also supports ACID transactions and snapshot isolation.
The ORC file format is often the better choice when compression is critical. Its lightweight compression techniques write compact files, which libraries like Snappy can make even smaller.
The columnar Apache Parquet file format is another member of the Hadoop ecosystem. Not wanting to play favorites, Parquet’s developers designed a storage format that would work with any Hadoop processing framework. Parquet supports complex nested data structures, schema evolution, as well as efficient data compression and encoding schemes.
Parquet is better at handling complex data structures, including wide table formats and nested data.
With highly efficient data compression, Parquet files are better in an analytics system’s write-once, read-many conditions.
ORC was designed to enhance Hive data warehouses. Parquet is more flexible, so engineers can use it in other architectures.
The choice of file format becomes most relevant when breaking free from proprietary data warehouse solutions and developing an open data warehouse on a data lake’s cost-effective object storage. This approach eliminates vendor lock-in and provides a more affordable path toward scalable big data analytics based on three open-source elements:
A data analytics stack based on Parquet, Iceberg, and Trino creates a more robust, flexible, and affordable big data analytics platform than conventional data warehouses.
Trino queries use ANSI standard SQL, making data directly accessible to advanced and less technical users and reducing their reliance on data engineering teams.
Trino also federates data from multiple enterprise sources, so queries are not limited to the structured data of a conventional warehouse. Besides providing more optionality for data consumers, using Trino SQL queries in a federated architecture streamlines data engineering workloads by replacing many ETL pipelines entirely with SQL queries.
Starburst includes everything you need to install and run Trino on a single machine, a cluster of machines, or even your laptop.
Cloud-native, frictionless, and fully managed. The fastest path from big data to better decisions.
Up to $500 in usage credits included