Published: May 3, 2023
Cloud computing makes extensive use of object storage. This has many advantages, including cost, speed, and scalability. Object storage contrasts with HDFS, and should be understood as an alternative technology. Importantly, object storage is not a file system. Unlike HDFS, it does not store data in files. Instead, it stores data in objects. Although you can think of each object as behaving similarly to a file, the two approaches are architecturally distinct. A data lake using object storage records data in a large sequence of objects. This impacts the kind of information that can be stored about objects.
Object storage is significantly cheaper than other alternatives, leading to massive savings for organizations. In fact, it is often the most economical way to house large amounts of data, creating a large economic incentive for businesses to make the shift to object storage. Although HDFS seemed inexpensive in its time, cloud object storage represented a paradigm shift which impacted the industry as a whole.
The ability of object storage to hold vast quantities of data also makes them ideal for workloads that read large amounts of concurrent, distributed data. This mirrors the distributed storage seen with HDFS, but improves on it considerably. Object storage offers good concurrence because it allows multiple storage servers to handle the reading of data in parallel. This approach makes object storage ideal for parallel processing applications. Query engines like Starburst make the processing of the data held in object storage faster and more efficient. How data lakes process object storage, including the use of metadata have improved in recent years with new technological paradigms!
The three largest providers of cloud-based object storage technologies are:
Data lakes constructed using cloud storage technology operate differently than their HDFS-based counterparts. These differences reflect the nature of object storage itself, and the cloud environment in which these systems run. Below we outline some of the advantages that cloud object storage has over HDFS installations.
Cloud systems allow resources to be increased to meet peak demand as needed. For example, imagine that a system exceeds its storage capacity. In an on-premises installation, additional resources would need to be purchased, hardware shipped and configured, and physical space allocated within one’s own organization. Cloud computing solves this problem by hosting vast numbers of servers around the world. Additional storage is added automatically, within seconds, in the background.
This is also true of compute resources. If additional processing power is required for analysis, this can be purchased and added to a cluster within seconds.
Storage adaptability also occurs in reverse. If additional storage or compute resources are no longer needed, they can easily be disconnected. This helps ensure that resources are managed precisely, helping to control costs and enhancing the surge capacity of cluster deployments.
Both object storage and cloud computing further enhance the separation between compute and storage. Originally, compute and storage were not separated because both occurred within the same HDFS nodes. Over time, separation was introduced as part of the shift to object storage because there was no longer a need to run jobs directly on the HDFS storage nodes.
Cloud storage enforces the separation further because it was never designed to combine compute with storage. Instead, storage technologies, like AWS S3, Azure Blob Storage, or Google Cloud Storage are storage technologies that lack a query engine. This decoupling means that you only pay for the computation and storage that you use as separate items, leading to significant cost savings. It also makes robust query engines, like Starburst, an essential component of cloud object storage.
With the emphasis on cloud computing often seen today, it’s important to remember that on-premises installations are still an important part of the industry. This approach uses in-house servers and dedicated computing resources to establish the infrastructure necessary to set up the data lake. Deployments of this nature have traditionally been based on HDFS, though object storage is used increasingly and has overtaken HDFS in businesses.
Because all servers must be located outside of the cloud, scaling is limited to the resources that an organization can acquire in-house. Any additional compute and storage capacity must be added manually. This can be both time-consuming and cumbersome to implement, and is often seen as a limiting factor for on-premises data lake installations.
Our SQL query engine can securely access data stored anywhere; across cloud and hybrid environments.
Apache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3.
There are 4 things that companies are hesitant about using their data lakes to provide a majority of their analytics
Up to $500 in usage credits included