
What is the Hive connector, and what is its purpose? On its surface, the Hive connector is easy. It connects Starburst clusters to Hive data. This Hive data differs from other data, for example, Iceberg. This article will unpack the Hive connector, explain why it still matters, even in the age of AI, and answer any questions you might have.
Where should you start with the Hive Connector?
One source of confusion stems from the complex Hive model and the connector’s overlapping use cases. Starburst is rightfully associated with Apache Iceberg. Many users come to Starburst because they are trying to transition from Hive to Iceberg or are moving away from Hadoop, Spark, or Hive infrastructure in some way. In fact, the genesis of our core technology, Trino (formerly PrestoSQL), stemmed from slow Hive queries at Facebook back in 2012.
So when you learn that Starburst and Trino have a Hive connector, it can sometimes be confusing. Isn’t Hive the thing you’re moving away from? The answer to this is that Starburst is first and foremost built on choice. And part of that choice involves Hive. The truth is, although we believe in Iceberg most of all and built our Icehouse architecture around it, Hive still runs many workloads. This is true, even today, in the agentic era.
Hive is legacy tech, but it isn’t dead just yet. Let’s have a look.
Hive architecture explained
To understand why Hive might still be used for some workloads today, let’s look at the origins and inner workings of the Hive connector. For this, it’s helpful to know a few high-level components of the Hive architecture first. Let’s unpack the Hive architecture further.

Hive architecture consists of four components
There are four components to the Hive architecture. Let’s look at each of them in turn.
Runtime
The runtime contains the query engine’s logic that translates SQL-like Hive Query Language (HQL) into MapReduce jobs that run over files stored on the filesystem.
Storage
The storage component is simply that; it stores files in various formats and index structures to allow them to be recalled. The file formats can be anything, from as simple as JSON and CSV to more complex columnar formats such as ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed File System (HDFS).
As cloud-based options became more prevalent, object storage such as Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged and replaced HDFS as the storage component.
Metastore
For Hive to process these files, it must have a runtime mapping from SQL tables to files and directories in the storage component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to the metastore, to manage metadata about files, such as table columns, file locations, and file formats.
Data organization specification
The last component not included in the image is Hive’s data organization specification. The documentation for this element exists only in the Hive code and has been reverse-engineered for use by other systems, such as Trino, to ensure compatibility.
Trino reuses all of these components except for the runtime. This is the same approach most compute engines take when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.
Starburst runtime replaces Hive runtime
There’s also a runtime aspect, and this is an area where Starburst and Trino overlap. In the early days of big data systems, many expected long query turnaround times due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply to achieve high throughput with large volumes of data while maintaining fault tolerance. Today, most businesses either want to run fast, interactive queries over their data rather than run jobs that take hours and produce possibly undesirable results, or run AI workloads. In both cases, many companies have petabytes of data and metadata in their data warehouse. All that data, all that AI context, is important and needs to be handled properly.
Data in storage is cumbersome to move, and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executes Hive queries needs replacement, the Trino engine uses the existing metastore metadata and files in storage, and the Trino runtime effectively replaces the Hive runtime that analyzes the data.
Unpacking the Trino Architecture

The Hive connector nomenclature
Notice that the only change in the Trino architecture is the runtime. The HMS still exists, along with the storage. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies migrating from Hive to Trino. Regardless of the storage component used, the runtime uses the HMS, which is why this connector is the Hive connector.
The confusion tends to come from searching for a connector within the context of the storage systems you want to query. You may not even be aware that the metastore is a necessity or even exists. Typically, you look for an S3, GCS, or MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.
The Hive Metastore Service
The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using the Thrift protocol. This service updates metadata stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements for the HMS, such as AWS Glue, which is a drop-in replacement.
One last thing. While the Starburst and Open Source Hive connectors share many similarities, Starburst has added several very important features, such as Ranger security integration for role-based access control, and is tested against multiple Cloudera builds. See the documentation for more details here.
Accessing the context trapped in legacy
All of this answers why Hive still matters in the age of AI. Essentially, it all comes down to context. We know that AI is only as effective as the data it can access and understand. For many of our largest customers, the most valuable business context, historical signals, deep operational logs, and multi-year customer trends live in Hadoop or Amazon S3 environments managed by the Hive Metastore. You cannot simply flip a switch and move that much mass overnight.
The Starburst Hive connector is a critical tool for AI readiness because it allows you to activate that data where it lives. By replacing the slow Hive runtime with Starburst, you turn a stagnant archive into a live source of truth for your AI models. You get the speed of modern computing without the multi-year migration tax.
Legacy technology like Hive is a bridge, not a barrier
We exist in a state of transition. While we believe the Icehouse architecture is the ultimate destination for Enterprise Intelligence, the Hive connector is the bridge that ensures no context is left behind.
It allows you to maintain your existing metadata and storage structures while giving your agents and analysts the interactive performance they need today. In the race for context, you win by using everything you have. Starburst gives you the optionality to leverage your Hive foundation while you build toward the future of AI. This ensures that your data strategy is grounded in reality rather than just aspiration.
This article was originally posted on the Trino blog.










