Data lake storage: Unstructured data, schema, metadata

Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes.

For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/ Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.

Functional/ Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/ Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.

Here, we explore the significance of each data type and how data lakes leverage them, along with crucial operational mechanisms like schema management, data organization, and metadata storage essential for efficient data lake functionality.

Structured data

Structured data makes up a significant amount of the data stored in any database, including a data lake. Data lakes are able to make use of structured data, and can store this data type alongside other formats. It is worth noting that although data lakes make use of raw data of any type, they can and do make use of structured data as well.

Semi-structured data

Semi-structured data is minimally structured, but has limitations to fit effortlessly into a relational database or other traditional data storage systems. Examples include JSON, XML, and CSV files.

That’s why data lakes are helpful as they are able to store semi-structured data alongside structured data, enabling data from one data type to coexist with data from the other.

Unstructured data

Unlike structured data and semi-structured data, unstructured data does not conform to any preset schema or format. As such, unstructured data is fundamentally unsuited to storage in a relational database and must be stored in a data lake.

Schema-on-read vs schema-on-write

Schemas define the structure of a dataset.

There are two main methods of organizing schemas that impact the storage of data in a data lake: schema-on-write and schema-on-read.

Schema-on-write

Schema-on-write is a data management construct where data schemas are created before the data is written to the database. When data later enters the system, it must be compliant with this schema from the outset. Data that does not fit the schema will be disallowed by the system.

This construct is closely tied to relational database management systems, and is useful in cases where the data in question already fits a specific, predictable format known in advance. Although used in some data lakes, this approach is more often associated with data warehouses.

Schema-on-read

Schema-on-read is a data management construct where a schema is validated when the data is read. Unlike schema-on-write, a schema validation is not completed when data is written to the data lake. Instead, the data is validated only when it is read. Schema-on-read is often associated with data lakes.

Indexing, partitioning, and bucketing

Data lakes do not traditionally make use of built-in indexing capabilities in the same way as relational databases. Some vendors, including starburst, are solving that problem with software enhancements.

By their nature, data lakes store large volumes of data. However, running traditional SQL queries requires the system to read every row in every table. This causes SQL run times to be long and hampers overall system efficiency.

Partitioning and bucketing address this by dividing large datasets into smaller groupings. You can think of each of these as sub-directories in a larger directory system. Their usage helps enhance the speed and efficiency of the system.

Related reading: Why partitioning doesn’t work

Storing metadata

The data inside a data lake can be stored in a number of different formats. Keeping track of this information requires the ability to keep track of these differences. This kind of data is called metadata, and the storage of metadata is an integral aspect of any data lake.

Here, we’ll learn how metadata is stored, and why it is critical in data lakes.

Structuring metadata

Metadata contains information showing how all of the data files in the data lake are organized. For this reason, even though the data lake itself is not structured, the metadata about that data is always structured and held in a separate repository.

In order to query data in a data lake, we need to understand how the data is structured. Since structure wasn’t imposed on the data when it was brought into the data lake we must supply this metadata via a metastore before the data can be effectively queried.

Metastores

A metastore is a special, dedicated repository used to store metadata relating to the data held in the data lake. It operates as a separate datastore that keeps track of the metadata for the system and fields requests about the structure of a given dataset. Typical metadata includes information relating to the storage system such as the file format, directory structure, and location of data within its files.

Two popular metastores include the Hive metastore and Amazon Glue. These services act as an intermediary between the user’s request and the datasets held in the data lake. When a request is made, information about the structure of the dataset in question is retrieved from the metastore. This information is then used to retrieve the source data from the data lake.

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Blog

Resources

Pages

Documentation

Data Lake Storage

A Data Lake Storage houses a wide variety of data types, including structured, semi-structured, and unstructured data. Each of these data types serves a specific purpose and brings unique value to the data ecosystem within a data lake.

What is a data lake?

Free e-Book

Modern Data Lakes For Dummies

Structured data

Semi-structured data

Unstructured data

Schema-on-read vs schema-on-write

Schema-on-write

Schema-on-read

Indexing, partitioning, and bucketing

Storing metadata

Structuring metadata

Metastores

Free e-Book

Modern Data Lakes For Dummies

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free with
Starburst Galaxy

For more deployment options:

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Starburst Galaxy

Starburst Enterprise

By Use Cases

By Industry

Documentation

Connect

Education

Filter:

Blog

Resources

Pages

Documentation

Data Lake Storage

A Data Lake Storage houses a wide variety of data types, including structured, semi-structured, and unstructured data. Each of these data types serves a specific purpose and brings unique value to the data ecosystem within a data lake.

What is a data lake?

Related blogs

Free e-Book

Modern Data Lakes For Dummies

Structured data

Semi-structured data

Unstructured data

Schema-on-read vs schema-on-write

Schema-on-write

Schema-on-read

Indexing, partitioning, and bucketing

Storing metadata

Structuring metadata

Metastores

Related blogs

Free e-Book

Modern Data Lakes For Dummies

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free withStarburst Galaxy

For more deployment options:

Start Free with
Starburst Galaxy