Published: May 17, 2023
Author: Evan Smith, Technical Curriculum Developer, Starburst
Technical reviewer: Tom Nats, Modern Data Lakes For Dummies author
Apache Hive is a data warehouse system built on top of Apache Hadoop and enables data consumers to query and analyze large datasets stored in Hadoop’s distributed file system and other file systems.
Hive also provides a high-level query language called Hive Query Language (HQL) or HiveQL, which is similar to SQL. Hive was designed to make it easier for users familiar with SQL to work with big data by providing a familiar SQL-like interface.
Additionally, there are a few ways in which data lake technology has been enhanced using the Hive framework. Special attention will be paid to the way in which Hive is built on top of an Hadoop framework. We will also explore how Hive server, Hive metastore, and Hive table format all work together to make Hadoop more accessible.
Finally, we will explore the limitations of Hive, and the reasons that it has largely been replaced by more modern query engines like Starburst. Let’s start by taking a closer look at Hive’s history.
With the Hadoop framework, it brought the ability to distribute large computing jobs using parallel processing. With the advent of cloud-based object storage, a technological revolution was under way.
But there was a problem. Hadoop was complex, especially for analytical tasks.
Creating MapReduce jobs required an intricate knowledge of Java that many users lacked. This gap would give birth to a new technology, Hive, which allowed users to interact with Hadoop by controlling MapReduce using SQL syntax. This was a game changing step as it opened up data lake analytics to a new audience and helped drive its adoption.
Apache Hive is a distributed framework built on top of the Hadoop framework. To do this, Hive makes use of the Hadoop query engine, MapReduce, by providing a more user-friendly layer on top of Hadoop.
Hive enables Hadoop data lakes to be queried using a SQL-like interface. This allows data analysts familiar with SQL to easily query large datasets, even when the underlying data itself is not structured relationally.
To achieve this, data is queried using Hive, but held in HDFS. This data can be further segmented using partitioning and bucketing where appropriate. In this way, very large datasets are divided into smaller batches, allowing them to be processed in parallel using both Hive and Hadoop together.
Importantly, Hive allows both structured and semi-structured data to be queried using a variation on SQL syntax known as HiveQL. To do this, queries addressing multiple datasets are transposed using SQL-like syntax onto Hadoop MapReduce workloads, and executed in the same way. You can think of Hive as adding an extra layer of abstraction on top of the Hadoop architecture underlying it. This abstraction allows SQL to be used in lieu of methods.
This approach also enables structured and semi-structured data to be brought together using SQL, while taking advantage of Hadoop architecture. Because SQL has existed for many years, and is widely known and used across many industries, this increases accessibility and widens the pool of data consumers capable of using the product.
Hive is able to analyze data primarily using the Hive server. This service takes user SQL input written in a specific form of SQL, known as HiveQL, and compiles it into a series of MapReduce jobs using Java. This approach allows the Hadoop systems underlying Hive to be controlled using SQL commands, improving user experience.
At the same time, the process is inherently slow, as the translation of one command into another requires a large processing overhead. For this reason, Hive queries are often considered acceptable for large jobs, but particularly unsuitable for small queries.
Hive stores metadata in a service known as the Hive metastore. This separate subsystem ensures that Hive is able to locate the data held in a data lake. Importantly, because it is a separate system, the Hive metastore is often used independently of the Hive server as a way of storing and organizing metadata. Such an approach is used in Starburst, though it does not use the Hive server.
Importantly, the Hive metastore only stores folder names, not the names of the files inside it. This approach works well for HDFS, but is more time consuming when applied to object storage. This architectural choice represents one of the key drawbacks of Hive when using object storage.
A table format specifies the way in which a schema is enacted when recorded as actual files. Data held in Hive is structured according to a Hive Table format.
Today, data lakes have more than one table format. The Hive table format is a legacy format. In contrast, recently, we have seen more adoption of the Iceberg and Delta Lake table formats. A table format is simply a way to separate these different technologies.
In addition to a table format, the files that actually store data have a format as well. The Hive table format supports many file formats. Two of the most optimal formats are ORC and Parquet because they are designed for analytical queries. However, other file formats like JSON and CSV are also supported.
Hive is popular and this popularity has had a large impact on its use in data lake solutions. Like the Hadoop system it is built upon, Hive has had a massive impact on the adoption of data lakes over the last decade. Despite this, Hive presents certain challenges.
Although each of these were intended to extend user capabilities, navigating this ecosystem was often both confusing and complex for many users.
These limitations arise from the need to compile SQL queries into MapReduce jobs using Java. Although the use of HiveQL allows users to write queries using a familiar language, the act of translating this language into a machine readable form represents an unavoidable processing overhead to Hive queries. This makes them particularly unsuitable to ad hoc queries, which are often necessary for many organizations.
The Hive table format does not allow query engines to query specific files individually without also querying all of the other files in that folder. This is also true when querying object stores. Although you can query specific objects using Hive, to find those objects it is necessary to list all of the objects in the object store. This limitation was inherited from Hadoop, and creates a significant processing overhead as it can take a long time to search through a large object store.
So far, we’ve seen how Hive was built on top of HDFS to provide SQL-like query functionality. This approach had many limitations owing to the compilation process needed to turn HiveQL into MapReduce.
Starburst presents an alternative approach to HiveQL. Starburst is a data lake analytics platform that uses standard SQL to achieve high performance in the data lake. This enables a platform-independent, single source of access. Notably, both Starburst Galaxy and Starburst Enterprise support multiple data sources, and datasets can be housed in data lakes, data warehouses, or databases. Queries can be federated across multiple sources, providing a best-of-all worlds approach.