Hadoop is an open-source framework for distributed storage and processing of large datasets

Apache Hadoop is an open source framework used to store and process large datasets.

Its development was critical to the emergence of data lakes, and its wide-spread adoption helped drive the rise of big data as we know it today. Although use of Hadoop-only installations has now been superseded by newer technologies, some of these have been built on top of Hadoop’s underlying framework. For instance, Hive is built entirely on top of Hadoop. Other technologies make use of Hadoop in other ways. For this reason, to understand data lakes it is important to understand Hadoop.

Data Lake BlogsAbout modern data lakes

O’Reilly Book

Data Mesh Book Cover

Get your free copy

Here, we will learn why Hadoop was developed, and how its architecture operates. Special attention will be given to how Hadoop processes data within a data lake, and how this paved the way for subsequent processing engines.

What is Hadoop and why is it used?

History of the technology

The history of modern data lakes was made possible by the rise of Hadoop. As the need to store and process extremely large datasets increased, existing platforms and tools were unable to meet demand. Data was changing, and the data lake, based on Hadoop, was about to play a central role in its evolution.  

Hadoop features

At its core, Hadoop allows complex queries to be executed in parallel using multiple processing units. This helps to increase efficiency by radically expanding the scale of available processing power. 

Hadoop includes a number of key features, allowing users to:  

  • Collect large amounts of raw data in varying structures. 
  • Store this data in a single storage repository, creating a data lake for the first time. 
  • Easily scale storage technology at a low cost. 
  • Query data using a processing engine, completing complex operations using multiple nodes. 

Hadoop architecture

Parallel processing

One of the key features of Hadoop, and what makes it ideal for data lake usage, is its ability to store and process datasets in large, distributed systems. To achieve this, the system makes use of a processing technique known as parallel processing. In essence, parallel processing takes large workloads, splits them into smaller processing operations, and processes the workload in parallel. 

Related reading: Foundations of parallel processing

Hadoop structure

Importantly, Hadoop offers both a storage and compute solution. To achieve this, the system is composed of several parts. The two most important of these are: 

1. Hadoop Distributed File System (HDFS)

Files are stored in Hadoop using a file system known as HDFS. It is designed to provide an efficient, reliable, and resilient way of accessing large volumes of data across multiple sources. 

2. MapReduce

Analysis is performed in Hadoop using a computational system known as MapReduce. This Java-based framework is capable of executing workloads on data stored on HDFS. You can think of HDFS as the storage, and MapReduce as the compute. To execute queries, complex MapReduce jobs had to be written in Java and compiled into a machine-readable form.

4 Strengths of Hadoop

Hadoop represented a major change to industry standards and it came with many advantages. These include: 

#1 Scalability

Hadoop clusters could be deployed at a large scale to meet the growing data needs of organizations. More than anything else, this cemented its role in the industry. 

#2 Cost

Hadoop clusters were inexpensive, reducing cost as a barrier to implementation. 

#3 Versatility

Hadoop was able to handle a versatile range of data sources. This advantage paralleled the advantage of data lakes more generally. 

#4 Adaptability

Hadoop could be deployed in many different scenarios and use-cases. 

4 Limitations of Hadoop

Despite its many benefits, Hadoop also has certain limitations. These include: 

#1 Cost

Historically, Hadoop itself was considered inexpensive because it used Commodity-class servers and storage rather than Enterprise-class servers and storage. However, these servers still combined compute and storage to achieve performance enhancements, which often meant that a certain number of resources remained unused at any given time. 

Today, cloud vendors allow businesses to separate compute and storage. This means that you only pay for the storage or compute resources being consumed at a given moment. As a result, data lakes are often considered to be less expensive than legacy Hadoop clusters. 

#2 Rigidity

Hadoop was developed primarily for on-premises installations, with cloud installations added as an afterthought. As a result, it struggled to adjust to the rise of cloud computing, being unable to match the rapid scaling and elasticity provided by those systems.

Related reading: See how cloud computing disrupted Hadoop as a technology and led to further innovation on top of it. 

#3 Complexity

Hadoop is a powerful system, but it is also extremely complex and demands significant expertise from its users. For instance, it required MapReduce jobs to be manually written in Java. To address this, significant user training is required to optimize deployment, management, and operations. This can often cause significant costs to the business. 

#4 MapReduce

MapReduce struggled with certain analytic computing tasks, offering poor concurrency in many scenarios.

Hadoop is the foundational data lake technology

Where does all of this leave us? Hadoop should be understood as the foundational data lake technology. It created the distributed infrastructure necessary to store large amounts of raw data and analyze that data in new ways, and established the separation of compute and storage, setting the direction for the industry. 

Some subsequent technologies did not replace Hadoop, but built on top of it, often making use of the same underlying processes and frameworks. 

Related reading: See how Hive was developed to sit on top of Hadoop as a query engine and interact with it in a more user-friendly way.

Get started with Starburst


Install anywhere

Starburst includes everything you need to install and run Trino on a single machine, a cluster of machines, or even your laptop.

Download Free

Marketplace offerings

Try Starburst in your preferred marketplace


Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.