Cookie Notice
This site uses cookies for performance, analytics, personalization and advertising purposes.
For more information about how we use cookies please see our Cookie Policy.
Manage Consent Preferences
These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.
These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.
These cookies allow our website to properly function and in particular will allow you to use its more personal features.
These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.
Its development was critical to the emergence of data lakes, and its wide-spread adoption helped drive the rise of big data as we know it today. Although use of Hadoop-only installations has now been superseded by newer technologies, some of these have been built on top of Hadoop’s underlying framework. For instance, Hive is built entirely on top of Hadoop. Other technologies make use of Hadoop in other ways. For this reason, to understand data lakes it is important to understand Hadoop.
Here, we will learn why Hadoop was developed, and how its architecture operates. Special attention will be given to how Hadoop processes data within a data lake, and how this paved the way for subsequent processing engines.
The history of modern data lakes was made possible by the rise of Hadoop. As the need to store and process extremely large datasets increased, existing platforms and tools were unable to meet demand. Data was changing, and the data lake, based on Hadoop, was about to play a central role in its evolution.
At its core, Hadoop allows complex queries to be executed in parallel using multiple processing units. This helps to increase efficiency by radically expanding the scale of available processing power.
Hadoop includes a number of key features, allowing users to:
One of the key features of Hadoop, and what makes it ideal for data lake usage, is its ability to store and process datasets in large, distributed systems. To achieve this, the system makes use of a processing technique known as parallel processing. In essence, parallel processing takes large workloads, splits them into smaller processing operations, and processes the workload in parallel.
Related reading: Foundations of parallel processing
Importantly, Hadoop offers both a storage and compute solution. To achieve this, the system is composed of several parts. The two most important of these are:
Files are stored in Hadoop using a file system known as HDFS. It is designed to provide an efficient, reliable, and resilient way of accessing large volumes of data across multiple sources.
Analysis is performed in Hadoop using a computational system known as MapReduce. This Java-based framework is capable of executing workloads on data stored on HDFS. You can think of HDFS as the storage, and MapReduce as the compute. To execute queries, complex MapReduce jobs had to be written in Java and compiled into a machine-readable form.
Hadoop represented a major change to industry standards and it came with many advantages. These include:
Hadoop clusters could be deployed at a large scale to meet the growing data needs of organizations. More than anything else, this cemented its role in the industry.
Hadoop clusters were inexpensive, reducing cost as a barrier to implementation.
Hadoop was able to handle a versatile range of data sources. This advantage paralleled the advantage of data lakes more generally.
Hadoop could be deployed in many different scenarios and use-cases.
Despite its many benefits, Hadoop also has certain limitations. These include:
Historically, Hadoop itself was considered inexpensive because it used Commodity-class servers and storage rather than Enterprise-class servers and storage. However, these servers still combined compute and storage to achieve performance enhancements, which often meant that a certain number of resources remained unused at any given time.
Today, cloud vendors allow businesses to separate compute and storage. This means that you only pay for the storage or compute resources being consumed at a given moment. As a result, data lakes are often considered to be less expensive than legacy Hadoop clusters.
Hadoop was developed primarily for on-premises installations, with cloud installations added as an afterthought. As a result, it struggled to adjust to the rise of cloud computing, being unable to match the rapid scaling and elasticity provided by those systems.
Related reading: See how cloud computing disrupted Hadoop as a technology and led to further innovation on top of it.
Hadoop is a powerful system, but it is also extremely complex and demands significant expertise from its users. For instance, it required MapReduce jobs to be manually written in Java. To address this, significant user training is required to optimize deployment, management, and operations. This can often cause significant costs to the business.
MapReduce struggled with certain analytic computing tasks, offering poor concurrency in many scenarios.
Where does all of this leave us? Hadoop should be understood as the foundational data lake technology. It created the distributed infrastructure necessary to store large amounts of raw data and analyze that data in new ways, and established the separation of compute and storage, setting the direction for the industry.
Some subsequent technologies did not replace Hadoop, but built on top of it, often making use of the same underlying processes and frameworks.
Related reading: See how Hive was developed to sit on top of Hadoop as a query engine and interact with it in a more user-friendly way.
Starburst includes everything you need to install and run Trino on a single machine, a cluster of machines, or even your laptop.
© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC
Up to $500 in usage credits included