Published: May 17, 2023
Author: Evan Smith, Technical Curriculum Developer, Starburst
Technical reviewer: Tom Nats, Modern Data Lakes For Dummies author
Apache Hadoop is an open source framework used to store and process large datasets. Its development was critical to the emergence of data lakes, and its wide-spread adoption helped drive the rise of big data as we know it today.
Although use of Hadoop-only installations has now been superseded by newer technologies, some of these have been built on top of Hadoop’s underlying framework. For instance, Hive is built entirely on top of Hadoop. Other technologies make use of Hadoop in other ways. For this reason, to understand data lakes it is important to understand Hadoop.
Here, we will learn why Hadoop was developed, and how its architecture operates. Special attention will be given to how Hadoop processes data within a data lake, and how this paved the way for subsequent processing engines.
The history of modern data lakes was made possible by the rise of Hadoop. As the need to store and process extremely large datasets increased, existing platforms and tools were unable to meet demand. Data was changing, and the data lake, based on Hadoop, was about to play a central role in its evolution.
At its core, Hadoop allows complex queries to be executed in parallel using multiple processing units. This helps to increase efficiency by radically expanding the scale of available processing power.
Hadoop includes a number of key features, allowing users to:
One of the key features of Hadoop, and what makes it ideal for data lake usage, is its ability to store and process datasets in large, distributed systems. To achieve this, the system makes use of a processing technique known as parallel processing. In essence, parallel processing takes large workloads, splits them into smaller processing operations, and processes the workload in parallel.
Related reading: Foundations of parallel processing
Importantly, Hadoop offers both a storage and compute solution. To achieve this, the system is composed of several parts. The two most important of these are:
Files are stored in Hadoop using a file system known as HDFS. It is designed to provide an efficient, reliable, and resilient way of accessing large volumes of data across multiple sources.
Analysis is performed in Hadoop using a computational system known as MapReduce. This Java-based framework is capable of executing workloads on data stored on HDFS. You can think of HDFS as the storage, and MapReduce as the compute. To execute queries, complex MapReduce jobs had to be written in Java and compiled into a machine-readable form.
Hadoop represented a major change to industry standards and it came with many advantages. These include:
Hadoop clusters could be deployed at a large scale to meet the growing data needs of organizations. More than anything else, this cemented its role in the industry.
Hadoop clusters were inexpensive, reducing cost as a barrier to implementation.
Hadoop was able to handle a versatile range of data sources. This advantage paralleled the advantage of data lakes more generally.
Hadoop could be deployed in many different scenarios and use-cases.
Despite its many benefits, Hadoop also has certain limitations. These include:
Historically, Hadoop itself was considered inexpensive because it used Commodity-class servers and storage rather than Enterprise-class servers and storage. However, these servers still combined compute and storage to achieve performance enhancements, which often meant that a certain number of resources remained unused at any given time.
Today, cloud vendors allow businesses to separate compute and storage. This means that you only pay for the storage or compute resources being consumed at a given moment. As a result, data lakes are often considered to be less expensive than legacy Hadoop clusters.
Hadoop was developed primarily for on-premises installations, with cloud installations added as an afterthought. As a result, it struggled to adjust to the rise of cloud computing, being unable to match the rapid scaling and elasticity provided by those systems.
Related reading: See how cloud computing disrupted Hadoop as a technology and led to further innovation on top of it.
Hadoop is a powerful system, but it is also extremely complex and demands significant expertise from its users. For instance, it required MapReduce jobs to be manually written in Java. To address this, significant user training is required to optimize deployment, management, and operations. This can often cause significant costs to the business.
MapReduce struggled with certain analytic computing tasks, offering poor concurrency in many scenarios.
Where does all of this leave us? Hadoop should be understood as the foundational data lake technology. It created the distributed infrastructure necessary to store large amounts of raw data and analyze that data in new ways, and established the separation of compute and storage, setting the direction for the industry.
Some subsequent technologies did not replace Hadoop, but built on top of it, often making use of the same underlying processes and frameworks.
Related reading: See how Hive was developed to sit on top of Hadoop as a query engine and interact with it in a more user-friendly way.
After a decade of running Hive queries on their data lakes, many companies are astonished at the speeds in which they are able to query their existing Hive tables by just replacing Hive with Starburst.
Read blog
Object storage contrasts with HDFS, and should be understood as an alternative technology. Importantly, object storage is not a file system.
Read more
This is where Starburst has begun to play a valuable role. In some cases, our customers use Starburst to simultaneously query data in Hadoop and another data warehouse, such as Oracle or Teradata.
Read more