×
×

Understanding open file formats for data analytics

An open file format is one of the cornerstones of a modern data analytics architecture.

Unlike text-based formats, these binary formats optimize reading and writing workloads of distributed petabyte-scale datasets while reducing storage costs in cloud-based data lakes and open data warehouses. Open file formats significantly impact query performance and the effectiveness of enterprise analytics. 

Data Lake BlogsIceberg and Trino

Modern Data Lakes For Dummies

Data Mesh Book Cover

Get your free copy

Last updated: January 29, 2024

This article will introduce the three primary big data file formats — Avro, ORC, and Parquet — and explain how the right format drives efficient, performant open data warehouses or an open data lakehouse.

What are open file formats? 

An open file format is a specification for the way data gets written to storage. Designed for large-scale, distributed storage systems, an open file format stores data in ways that reduce reading or writing overhead while providing efficient data storage. With open source, companies can avoid vendor lock-in and make their data more portable.

Columnar format vs. row-based | Row-oriented vs. column-oriented file formats

Unlike human-readable text formats like JSON or CSV, open file formats create machine-readable binary files that are more efficient to store and process. The way a file format records its data determines how efficiently it uses storage and the data’s accessibility.

To help understand how file formats store data, let’s consider a table of baseball stats:

Line Player Name Games At-Bats Runs Hits
1 John Doe 110 432 98 134
2 Alan Buck 85 261 34 72
3 George Deere 32 57 20 21

A comma-delimited CSV file would store the data like this

Line,Player Name,Games,At-Bats,Runs,Hits

1,John Doe,110,432,98,134

2,Alan Buck,85,261,34,72

3,George Deere,32,57,20,21

Text files like these are easy to create and manage for small datasets, such as when transferring data to and from an Excel spreadsheet. However, storing data this way becomes untenable as data volumes grow.

Row-based formats

Transactional data processing systems record financial transactions, web traffic, and other operational activity. These systems must reliably read and write data in real-time. For that, they require a file format optimized for record processing.

The data associated with each record is stored next to each other. In the case of our baseball stats, row-based storage might look like this:

1,John Doe,110,432,98,134,2,Alan Buck,85,261,34,72,3,George Deere,32,57,20,21

Although harder for us to read, this approach streamlines record retrieval. The system can go to the storage location of the record’s first data element and read the rest sequentially.

The data labels are typically kept separately in the file’s header or footer, along with other metadata that define the file’s schema.

Although writing and reading row-based files happen incredibly quickly, they do not work as well in an analytics context since the system must load every record to run a query. For example, calculating a team’s total number of hits requires reading each player’s record into storage and extracting that player’s hit data. That workload is trivial at the scale of a baseball team, but imagine instead you are analyzing a month of financial transactions at a global bank. Poor query performance is the main reason companies use data warehouses rather than transactional databases for business analysis.

Column-based formats

Data warehouses can’t match transactional databases for raw speed, but they make up for it in query efficiency. Their file formats contribute to this efficiency by using a column-oriented format to store data for each column next to each other like this:

John Doe, Alan Buck, George Deere

110,85,32

432,261,57

98,34,20

134,72,21

Queries usually process data within a table’s columns. Queries can locate and read data quickly when files use a columnar format. To calculate the average number of hits in our team’s data, a query simply extracts the “Hits” column and does the math.

Row groups may hold text strings, integers, boolean values, and other data types within the same record. That diversity limits the available compression techniques and how much space they can squeeze from the records. Columnar storage formats, on the other hand, group data of the same type, allowing more efficient compression schemes.

How to choose a file format?

Data engineering teams must evaluate their use cases before choosing a file format. For example, what is the frequency of reads and writes? A warehouse’s data gets written once upon arrival through ETL data pipelines. After that, data may get read frequently but rarely gets re-written. By contrast, an operational system’s relational database must handle extreme read and write volumes.

How file formats influence query speed

As we’ve discussed, the way columnar file formats store similar data together makes them inherently query-friendly. But how a file gets written to storage is not the only way a format speeds queries. The design of modern file formats helps screen irrelevant data so the query can process only the data it needs.

File formats include metadata that detail the file schema, data structure, storage locations, and other information about the data they hold. Self-describing files provide queries with the information needed to retrieve the data they contain without referencing a separate database. 

Query engines like Apache Spark can use this file metadata in predicate pushdown filters to reduce the data it must process. A query looking for data generated within a specific date range would use predicate pushdown to ignore files whose contents are outside that range.

How files are written can further enhance a query’s ability to skip irrelevant data. Sorting data before writing will group it in ways that speed pushdown queries. For example, sorting by date will make it easier to ignore out-of-range data.

After finding and reading the data, a query must decompress it before processing. Columnar file formats can use the most efficient encoding schemes for each column of data, allowing query engines to extract the data quickly.

Finally, open file formats are designed for today’s distributed cloud storage systems. They can be split across many disks, allowing query engines to use parallel processing at massive scales.

Avro open file format

Apache Avro is a row-based file format best suited for write-intensive operations or when data formats may change over time. Avro data serialization is binary, allowing for cost-effective storage of large amounts of data. However, this self-describing format uses easily-ready JSON to store its schema, making Avro files easier to manage in Python and other languages with JSON libraries.

Of the three main file formats, Avro has the strongest support for schema evolution. If data changes over time, schema changes like adding or modifying fields are easy to implement.

Avro’s superior write performance makes it useful for ingesting data into a data lake and for stream-processing platforms like Kafka.

ORC open file format

Optimized Row Columnar (ORC) replaced Hadoop’s original Record Columnar File (RCFile) format to improve Hive’s SQL-like query service. This column-based format supports Hive’s data types but is not dependent on the Hive Metastore. As a result, ORC is a default choice for data warehouses that use Hive for analytical querying. ORC also supports ACID transactions and snapshot isolation.

The ORC file format is often the better choice when compression is critical. Its lightweight compression techniques write compact files, which libraries like Snappy can make even smaller. 

Parquet open file format

The columnar Apache Parquet file format is another member of the Hadoop ecosystem. Not wanting to play favorites, Parquet’s developers designed a storage format that would work with any Hadoop processing framework. Parquet supports complex nested data structures, schema evolution, as well as efficient data compression and encoding schemes.

What are the advantages of using Parquet over other file formats?

Parquet is better at handling complex data structures, including wide table formats and nested data.

With highly efficient data compression, Parquet files are better in an analytics system’s write-once, read-many conditions.

ORC was designed to enhance Hive data warehouses. Parquet is more flexible, so engineers can use it in other architectures.

Data warehouse vs. open data warehouse with a data lake

The choice of file format becomes most relevant when breaking free from proprietary data warehouse solutions and developing an open data warehouse on a data lake’s cost-effective object storage. This approach eliminates vendor lock-in and provides a more affordable path toward scalable big data analytics based on three open-source elements:

  • Open query engines like Trino.
  • Open table formats like Iceberg.
  • Open file formats like Parquet.

These elements provide the framework for building an analytics architecture on commodity object storage services like Amazon S3 or Azure Blob Storage.

Open file formats, Iceberg, and Trino for data engineers

A data analytics stack based on Parquet, Iceberg, and Trino creates a more robust, flexible, and affordable big data analytics platform than conventional data warehouses. 

Trino queries use ANSI standard SQL, making data directly accessible to advanced and less technical users and reducing their reliance on data engineering teams.

Trino also federates data from multiple enterprise sources, so queries are not limited to the structured data of a conventional warehouse. Besides providing more optionality for data consumers, using Trino SQL queries in a federated architecture streamlines data engineering workloads by replacing many ETL pipelines entirely with SQL queries.

Get started with Starburst

 

Install anywhere

Starburst includes everything you need to install and run Trino on a single machine, a cluster of machines, or even your laptop.

Download Free

Cloud-native, frictionless, and fully managed. The fastest path from big data to better decisions.

Start Free

Marketplace offerings

Try Starburst in your preferred marketplace

 

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.