Cookie Notice
This site uses cookies for performance, analytics, personalization and advertising purposes.
For more information about how we use cookies please see our Cookie Policy.
Manage Consent Preferences
These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.
These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.
These cookies allow our website to properly function and in particular will allow you to use its more personal features.
These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.
Published June 29, 2023
Unstructured, semi-structured, and structured data have very different data formats, scalability, and real-time data analysis with business implications for data-driven decision making and interpretation.
With the growth of big data and enterprise data, scalability has also become a critical factor in data analysis. Organizations need scalable infrastructure, such as a modern data analytics architecture or cloud-based platforms, to process and store large volumes of data efficiently.
Additionally, real-time data analysis is a competitive advantage. Organizations that are able to collect and process data immediately or near-instantaneously as it is generated will have an edge.
Let’s take a closer look at how unstructured, semi-structured and structured data impact an organization.
Unstructured data is not conformed to any preset schema or format. Traditionally, unstructured data was rare, but this has evolved due to the rise of data from online sources. It is an abundant data type and represents a significant challenge due to its complexity and lack of organization.
As such, it is fundamentally impossible to store unstructured data in a relational database and must be stored in a data lake. The need to house and analyze this type of data was one of the main reasons for the data storage development of data lakes.
Examples of unstructured data and their file formats include:
Imagine a photo album on your phone. It houses many individual files. However, each photo isn’t hierarchically ordered into neatly-structured columns and rows because that’s not the best way to store this kind of data.
Analyzing datasets of this type can often yield rich insights. However, traditional analysis would require massive transformation of the entire dataset upfront.
Unstructured data is often difficult to store and analyze, but it can be very valuable to data analytics and data management. Use cases for unstructured data might be:
Structured data conforms to a regular, predefined schema. Usually, this pattern is set out in advance before any individual data is ever collected. You can think of this structure as a blueprint that governs the way that data within the dataset is stored.
Unstructured data lacks a predefined format and organization, such as text, images, or social media posts. Structured data, on the other hand, has a predefined structure and follows a consistent schema, like spreadsheets or databases.
Examples of structured data include:
Structured data is particularly suitable for quantitative data analysis and data mining techniques, as it enables straightforward aggregation, filtering, and manipulation.
Structured data is recorded as a series of columns and rows. Each row represents a different instance, a different individual record. At the same time, each column represents a different attribute of those instances. Structured data makes up a significant amount of the data stored in any database, including a data lake.
The image below shows an example of structured data. Notice how each element in the database is positioned in relation to the other elements. This positioning captures and represents the relation between elements in the real world, and can be adjusted as necessary.
Structured data is powerful and is typically stored in a data warehouse in a tabular format. The data in a data warehouse is typically stored in a relational database management system (RDBMS).
Structure provides certain minimum standards that govern the type of data collected. By recording that data in predefined fields, you can ensure that as much of the data collected as possible is correctly tagged and listed in fields. These divisions can later be used to compare data, or to perform operations. In this sense, the way that you structure data today can help you perform analysis on that data tomorrow.
You create structured data all the time in your everyday life without knowing it. Data lakes make use of structured data to store information from a variety of sources. Imagine that you enter a new contact into your phone. To create the contact, you’ll need to provide some specific information which is entered into fields. This might include a first name, last name, email address, and phone number. All contacts created in the phone require the same information each time a new entry is created. This is structured data at work.
Datasets can also be semi-structured. Semi-structured does not use tables, rows, or columns. Instead, datasets of this type use tags and markers (i.e. metadata) to achieve some of the same results seen in structured datasets without the same need for a predefined schema. Data lakes are able to store semi-structured data alongside structured data, allowing data from one data type to coexist with data from the other.
Examples of semi-structured data and their data formats include:
You make use of semi-structured data every time you browse the web. Imagine that you are reading the news on a website. When you scroll through the list of articles, you’re viewing data that was retrieved from a database. This data is accessed and displayed by your browser according to predefined rules.
In this model, there is both a backend system storing the information and a frontend system displaying it. Do they speak the same language? Most likely not, but they can store information in intermediate languages like XML and JSON.
To do this, the data lake holds data as a sequence of data pairings where the data itself is held as a value, and the label for that data is recorded as a key. This structure is known as a key-value pair. Later, when you want to access a particular element’s value, you simply ask the data lake to access the unit of data that corresponds to the required key.
Key-value pairs are a powerful data model and long strings of pairings can be held. Further hierarchy can be introduced by placing one set of key-value pairs inside another, creating a nested, parent-child relationship. This mirrors the function structure found in object-oriented application languages e.g. Java, JS, C++, C#, or Python.
The image below shows a common unstructured data type known as JavaScript Object Notation (JSON). You can see a sequence of key-value pairs, each with their own relation. If a system or user wanted to access Value 3, they would need to input Key 3. Data lakes are particularly good at storing semi-structured data.
Sure, techniques like text mining, and machine learning algorithms, powered by artificial intelligence, are employed to extract meaningful information from unstructured data. However, for now, large language models can be prone to errors. We can’t have our customers making decisions based on bad answers. Your data science teams, data engineers and data scientists won’t be happy about that.
Starburst enables organizations to query data using structured query language (SQL) across various data sources, including unstructured, semi-structured, and structured data.
Here’s how Starburst facilitates querying for each of these three data types:
Unstructured data presents a unique challenge to organizations because it lacks structure, but it also holds immense potential for gaining valuable insights when processed and analyzed effectively.
Starburst can integrate with tools like Apache Hadoop and Apache Kafka, which are commonly used for storing unstructured data. Through connectors and integrations, Starburst can access these data sources and enable SQL-based querying. By defining external tables and schemas, users can leverage SQL to query the unstructured data and extract relevant information.
Starburst supports connectors to popular data systems like Apache Hive, and Apache Drill, which are designed to handle semi-structured data. These connectors enable organizations to access the semi-structured data and use SQL queries to extract and manipulate the desired information.
Finally, Starburst can connect to a wide range of structured data sources, including traditional databases like Oracle, SQL Server, MySQL, as well as cloud-native data warehouses like Amazon Redshift, Google BigQuery, and Snowflake. By establishing connections to these sources, organizations can leverage SQL to query and analyze the structured data.
With Starburst, because 80% to 90% of data generated and collected by organizations is unstructured, you can effectively go from a “data warehouse for all your data analytics” to a “data warehouse is a component of a modern data lake.”
© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC
Up to $500 in usage credits included