Published: May 9, 2023
Businesses and organizations generate huge amounts of business data from a wide variety of sources, such as sales, marketing, and customer interactions. To be successful, decision-makers must tap into the data they generate and use it to gain insights and make business decisions. The ability to do this successfully enables organizations of all types to continue growing and adapting, even as the landscape of their industry changes.
Accessing and leveraging data, however, is challenging. Proper data management systems require an approach that takes into account both the sources of data and how that data will be used in the future.
Data warehouses are one of several possible solutions available to businesses and organizations that want to use their data to make smart business decisions. Data warehouses have been used extensively in the data space for over 40 years.
In this post, we’ll examine the journey from data to data warehouses, define data warehouses, the advantages and disadvantages with using data warehouses, and explore how they are used to store data for analytics. Finally, we’ll also learn about some of the most popular types of data warehouses and the systems used to create them.
As we take a closer look at the journey from data to data warehouses, we know that when data is generated, it is often collected and processed by an Online Transactional Processing (OLTP) database.
For example, a point of sale system that logs a sale and payment information or an ATM that allows a customer to deposit a check and withdraw money. Transactional databases are excellent for collecting information at the moment, but the features of these databases that allow them to process transactions quickly make them inefficient for retrieving data for analysis.
Therefore, the data analysis used to make business decisions is often done using a different database system. These types of databases are called Online Analytical Processing (OLAP) databases because they are optimized to perform analytical functions quickly. And an example of an analytical database is a data warehouse.
A data warehouse is purpose-built database system that help data consumers use current and historical data to generate insights to inform business decisions, including the reporting and forecasting of business activity. Data warehouses are also optimized for handling large workloads to support various types of analytics, including: predictive analytics, data visualization, and decision support.
You may have noticed that data warehouses allow for quite a bit of flexibility because it does not stipulate any particular tools or hardware for storing data. This flexibility is precisely what makes data warehouses a powerful tool: they can be built to efficiently perform the specific functions that best suit a particular business.
The video above focused on how data warehouses are optimized for analytic speed and efficiency. This efficiency is one of the positive attributes of data warehouses. It is possible because data warehouses store structured data. This data has been transformed and loaded into the data warehouse following a specific design that has been tailored to the types of questions a business wants to answer. Analysts know the structure of this data and can query the data warehouse easily.
However, the efficiency of data warehouses for answering specific business questions comes at a cost.
A data warehouse’s structure must be rigorously planned by data warehouse architects before the data warehouse is built. This data model is called schema-on-write because the collected data is transformed to follow a pre-existing structure as it is stored. Designing and implementing a data structure that will work for a data warehouse is time-consuming and requires expert guidance.
Additionally, once data has been transformed and loaded into the data warehouse, the way that data is structured cannot be easily modified. Any new data loaded into the data warehouse must follow that original design so that the data warehouse works as originally intended. Therefore, if new questions arise the data may not be structured to answer them efficiently. This lack of adaptability can create problems as industries and the types of data it is possible to collect evolve.
Some types of data lend themselves well to storage within a data warehouse. For example, financial transaction data, operational data, customer relationship data, and enterprise resource planning data are typically stored in a data warehouse.
However, organizations don’t typically store all of the data they collect in a data warehouse. To do so would be cost-prohibitive in terms of both volume and database administrator bandwidth.
Social media data, documents, and sensor data are some examples of unstructured data that might not be stored in data warehouses because they cannot be easily consolidated or structured. Data of this type is typically handled by other technologies, such as data lakes or data lakehouses, that do not restructure data before it is stored.
Some organizations use a data warehouse as their only analytical data repository. In these organizations, data analysts would only have access to data stored in a data warehouse. This could be limiting because data warehouses might not store all of the data the organization collects.
Whether or not this is a problem depends on the questions that the organization needs to answer. If new questions need to be answered or new data becomes available, it can be difficult to adjust the data warehouse. If this is a problem, the organization might consider using a data lake alongside its data warehouse or using a data lakehouse to improve the lifecycle of their data.
Certainly, data warehouses are an important part of the data landscape. Many companies choose to use data warehouses, either on their own or alongside other data solutions. Whether or not a data warehouse is the best solution for a particular use case depends on the specifics of that use case. Some things to consider are:
As we’ve learned above, data must be structured before it is moved into a data warehouse.
Now, we’ll examine how data in data warehouses is typically structured, and how this structure differs from how data is structured in transactional databases. More specifically, we’ll explore schemas used in analytical databases, focusing on how schema influences database function.
Collections of data that are organized for rapid retrieval are known as databases.
In relational databases, data is organized into a schema. Think of a schema as being similar to a blueprint. It defines both the structure of the data within the database and its relation to other data. The data within a schema is organized into tables that are connected with keys.
Relational databases can use a wide variety of schemas, and the type they use determines the functions that the database performs well. Normalized schemas used in transactional systems are best for writing data from real-time transactions.
In contrast, other schemas, such as those used in analytical systems like data warehouses, are best for storing and reading large amounts of data for analysis.
These different schemas allow for a huge array of possibilities for database design, but it is critical to understand that systems optimized for processing transactions will not function well for analysis, and vice versa.
Next, we’ll explore schemas commonly used in online analytical processing(OLAP) databases, particularly in dimensional data warehouses.
Business analysts need to use data that has been optimized for analysis, which means retrieving a huge number of records quickly and efficiently. However, these databases typically don’t need to be updated frequently because business analysts are working with aggregated data. Therefore, unlike in transactional databases, avoiding data redundancy is not required.
One way to efficiently store data in analytical databases is to structure the data following a dimensional database model. This storage structure differs from a normalized database design by intentionally denormalizing data so that related information is grouped in tables, even when that means tables contain dependencies or redundancies.
The tables in dimensional models include fact tables and dimension tables. Fact tables contain aggregated measurements and metric data that summarize the data a business has collected in a variety of ways. Essentially, they provide analysts with easy access to facts about the business. For example, if a business had provided a place for customers to subscribe to an email list, a fact table might store information on the number of subscribers per month and year, as well as the methods by which they subscribed. These aggregations might help data analysts determine that customers are more likely to subscribe to the email list in July than in January, which could help guide advertising efforts.
In contrast, dimension tables describe the facts contained in the fact table by providing context about those facts. Continuing the email list example from above, the dimension tables might group customers in specific ways, or provide customer location or the website pages through which customers subscribed to the email list.
In data warehouses that use a dimensional model, data is stored in both dimension and fact tables. Storing data in these tables makes it easy for business analysts to query data that is relevant to business performance and look at historical trends because the data is already stored in an aggregated manner. Dimension tables have keys that link them to fact tables so they can provide the context around the facts. Fact tables contain the majority of the data in data warehouse schemas, so they occupy most of the storage space in the data warehouse.
The three primary dimensional schemas used in data warehouses are star, snowflake, and galaxy schemas, which promote fast and efficient querying of large data sets.
These schemas differ in how normalized their dimension tables are. They also differ in how dimension tables are related to fact tables. These differences influence the efficiency with which data can be stored and queried.
The descriptions below outline how each of these schemas functions in practice and explore them with a simple example of credit card payments data from Burst Bank.
In a star schema, each dimension table is related directly to a fact table. Further, each dimension is represented with only one table. The dimension tables are not normalized. This schema works well when the data being stored is not very complex. Queries on star schemas are extremely efficient because they do not need to join very many tables.
In the figure below, the credit card payments fact table is related to four dimensions: location, date, credit card account, and customer group. Customer group includes estimated income and age. The fact table also includes many aggregate measures of data credit card payment data, such as the percent of full payments collected and the percent of minimum payments collected. Business analysts can use these measures to answer the types of questions that will help their company make business decisions. An example of a question that they could answer with this data is: Are customers over the age of 40 more likely than customers under 40 to make full payments?
In a snowflake schema, each dimension connects to a central fact table. Some of the dimension tables are normalized so that they have sub-dimensions. This normalization reduces some of the data redundancy and makes it easier to maintain and store data. However, this comes at the cost of less efficient queries because each query requires more tables to be joined together.
In the figure below, the credit card payments fact table is related to the same dimensions as in the star schema example. However, each dimension table is also connected to one or more sub-dimension tables. For example, the date dimension table has sub-dimensions for month, quarter, and year. The sub-dimension tables for month and year include abbreviations that business analysts might use to look up month or year from the database.
The galaxy schema, also known as the fact constellation schema, has several fact tables that each hold the keys and aggregate data for specific parts of a business. Each fact table can connect to some of the same dimension tables, but each also connects to some dimension tables not connected to other fact tables. In this design, dimension tables cannot have sub-dimensions. This schema results in slower queries than both the star and snowflake schemas because its complexity means more tables must be joined when the data is queried. However, galaxy schemas can be useful for extremely complex data because they lower data redundancy by normalizing the data and mapping data to multiple fact tables.
In the figure below, a fact table that contains aggregated data about the merchant at which customers made purchases has been added to Burst Bank’s data warehouse. The merchant fact table is connected to a dimension table that describes merchants by the category of goods or services they sell.
Now that you’ve explored the different types of schemas used to store data in data warehouses, you’ll watch a video to explore how schema affects how efficiently data can be queried. This video also shows the transformation from a third normal form (3NF) schema that would be used in a transactional database to a star schema that would be used in a data warehouse.
Data warehouses typically use star, snowflake, or galaxy schemas to store data. These schemas each use a combination of fact and dimension tables to store data. Fact tables make up the bulk of the data in a data warehouse, and store data aggregations of business value, along with keys that link them to dimension tables.
Dimension tables store information about data, such as time. Unlike schemas used in transactional databases, the schemas used in data warehouses do not need to be normalized.
So far we’ve explored how the schemas used in data warehouses differ from those used in transactional systems, and how these schemas improve query efficiency. Query efficiency can also be influenced by how data is stored within the tables of the schemas.
Now, we’ll explore how the format of tables that store data influence query speed in a data warehouse. First, we’ll review the difference between row-oriented and column-oriented file formats. Then, you’ll examine how the file format used in a data warehouse influences query speed.
When data is stored in a row-oriented manner, each record or observation is recorded in a row. The attributes of each record are recorded in columns. Data is stored in a row as an object with each of its respective attributes. In contrast, when data is stored in a column-oriented manner, each attribute of the record is stored as its own object.
The figure below shows the differences between these two types of storage. This example table holds student data. Each record contains information about each student, including Student ID as the primary key, First Name, Last Name, Email, and Age. The data doesn’t change based on the storage type, but how the data is sorted and read differs.
In a row-oriented database, the data will be read on a row-by-row basis, from left to right.
In a column-oriented database, the data is stored as columns and read from top to bottom.
In the row-oriented table, you can see that the data on each student is stored together, with Student ID as the primary key and First Name, Last Name, Email, and Age as the attributes. In the column-oriented table each attribute, such as First Name, is stored with other data about that attribute. So, all Student IDs are stored in one object, all First Name values are stored in another object, all Last Name values are stored in another, and so on.
Row-oriented files collect and store data about an object and its attributes. This makes them ideal for transactional systems, where data is collected at a granular level. However, analytical queries typically focus on columns such as student age or total sales, rather than on rows such as individual students or customer purchases. When data that is stored in a row-oriented format is analyzed by a system that operates on columns, data from every attribute must be read, even if a query is only related to several attributes. This can significantly slow processing time, which increases compute costs and slows the time it takes to obtain insights.
Since analytical queries typically focus on columns, using column-oriented file formats means queries can be more efficient because analytical systems are able to read only the columns that are relevant to a particular query. Additionally, the data is stored by attribute, so it is easy to calculate aggregated data points such as min, max, average, and standard deviation on a set of columns. For these reasons, column-oriented file formats are best for analytic queries.
We have now explored the common schemas used in data warehouses and looked at how the file formats used to house data influence analytic query speed.
Next, it’s time to consider how data gets loaded from source systems into data warehouses and how it is accessed by data consumers. In the context of a data warehouse, data consumers are usually business analysts who use the data stored in data warehouses to make informed business decisions.
The process of moving data from capture to analysis is called a data pipeline. Data pipelines are extremely important because they influence the speed, cost, and efficiency with which business decisions can be made.
Let’s explore data pipelines and their use in data warehouses. Businesses use data pipelines in different ways, so we’ll also explore the variation in data pipeline architecture, including some common use cases for different scenarios.
A data pipeline consists of two main steps. The first step is the ETL pipeline, in which data is moved into a data warehouse from any source data system via a process known as Extract, Transform, and Load (ETL). This step is critical because the data that resides in data warehouses must follow the schemas and use the file formats chosen by the data warehouse architect to fulfill the business’ needs. This means that datasets from different sources, each containing different file structures, may need to be processed in different ways to ensure that they fit the data warehouse’s schema.
The second step in the process is data consumer usage. In this step, data stored in a data warehouse is accessed by end data consumers. This process usually involves Business Intelligence (BI) tools or other integrations. Business analysts use these tools or integrations to query and display data and use it to create reports and derive insights. These big data insights are then used by business analysts and others to answer business questions, improve decision-making and predict future trends.
In the image below, we explore the way a data pipeline is used with a data warehouse.
The data entered into a data warehouse can be extracted from multiple sources. Most frequently, the data comes from either a transactional or operational database where data is captured. For example, data might come from Internet of Things (IoT) devices, mobile, or online systems. This is made possible by the data model, which defines how data is organized and structured within the warehouse.
This process extracts data from the source systems and moves it to the staging layer. The data moved into the staging layer is usually a copy of the original data.
The staging layer (or area) is where the copied data resides before it is transformed and loaded into the data warehouse. The staging layer is important because it ensures two things:
Data is transformed so that it fits the data warehouse’s pre-defined schema, which is based on business needs and the tradeoff between query time and load time.
In some cases, data might only be copied. Other transformations might include cleansing, checking for data quality, filtering, or aggregating data. Once the data has undergone all required transformations, it is loaded into the data warehouse.
ELT is similar to ETL, but the transformation step is performed after the data has been loaded into the target system. This approach can be more efficient as it can use the processing power of the target system to perform the transformation and also allows for the parallelization of data processing tasks, significantly speeding up processing times.
Once data has been transformed and structured according to predefined schemas, it is stored in the data warehouse.
This layer makes data easier for data consumers to access by translating the names and labels of data in source systems into business-friendly language. Implementation of this layer is flexible, and can be done through:
Often, this layer is implemented through some combination of these three options. This layer may also contain common queries and further transformations that make the data more directly useable.
Data consumers use these tools to access data via the semantic layer. It is then used to build reports and run analyses.
Data warehouses also often provide APIs and apps that enable ad-hoc querying and reporting, enabling users to explore and analyze data as needed.
Common business intelligence tools include Tableau, PowerBI, ThoughtSpot, and Looker.
you might see the data pipeline broken down differently, such that the ETL pipeline and data warehouse are called the bottom tier, the features like the semantic layer that make the data warehouse accessible to users are called the middle tier, and the BI tools or integrations used by data consumers are called the top tier.
ETL pipelines vary considerably. Each data warehouse is structured according to a particular business need. However, even at the scale of a single data warehouse, there will be variation in the ETL pipelines. This is because data from different source systems is being combined into the same schema in the data warehouse, and the steps needed to process the data from each source system might be different.
For example, a business might want to know how customers feel about their purchases. Answering this question would likely require data from at least two source systems: customer purchase data from a transactional database that shows what the customer purchased, and survey data that captures how a customer felt about their purchase. To use these sources of data to answer the question, these records need to be transformed and then combined. The way the survey system records the transaction date might differ from the way the transactional system records the date, but all dates in a data warehouse must be stored the same way.
A data warehouse contains all of an organization’s data. Often, though, data is analyzed at the department level. For example, within one organization Human Resources (HR), User Experience (UX), and Marketing might each employ analysts who specialize in answering questions relevant to their department. Each of these departments has a specific business purpose and set of queries they might run to create the reports and analyses they need.
Data marts provide a solution to this problem. Some departments may build and operate their own data marts independently from the enterprise data warehouse. Alternatively, they might segment the data from a data warehouse into a data mart that holds data specific to a particular department. Data may even be transformed as it is moved from the data warehouse to the data mart so that it can be used to efficiently answer that company’s questions.
Data marts can increase the flexibility of how data in a data warehouse is used but can also increase administration costs and complexity. When organizations use data marts, they can exist between the data warehouse and semantic layer, or they can operate separately and have their own data pipeline.
Data warehouses are not tied to any particular hardware or software. This means they can be implemented in a wide variety of ways. Here, we explore several different implementation styles, the pros and cons of each, and several examples of data warehouse vendors in the following sections:
Historically, data warehouses were implemented through hardware on-site at a company. This is known as on-premises and is frequently shortened to on-prem. Over the last decade, however, cloud implementations of data warehouses have evolved to become the dominant model. Organizations can choose to implement their data warehouse on-prem or in the cloud. Alternatively, they can use a hybrid approach.
Each implementation style has advantages and disadvantages. The implementation that a company chooses requires them to weigh a number of factors against each other. Although the factors an organization cares most about will differ among organizations, common considerations include speed, control, scalability, reliability, security and governance, and cost.
The table below outlines some of the advantages and disadvantages of on-prem and cloud data warehouse installations. The comparison looks at the seven specific factors listed above. Read through the table. Can you see any points where one installation type or another might be more advantageous in your organization?
Factors that influence where to implement a data warehouse |
|||
Factors |
Why it matters |
On-premise advantages |
Cloud advantages |
Speed | Requires a balance between setup time and time-to-insights once setup is complete | Quicker to obtain insights if a company is in one location | Quicker setup time because the hardware doesn’t need to be set up and fewer team members to train
Quicker to obtain insights if a company is spread out but needs to transfer data among locations |
Control | Decisions about setup, implementation, and access | Company has complete control | Some decisions are left to the cloud vendor and may not be adjustable |
Scalability | The needs of a business will change over time, seasonally, and as a company grows | No advantages over cloud data warehouses. | Easier to scale up and down because no hardware is required |
Reliability | Backing up and accessing data is critical — how often will maintenance be required? | Depends on your team. | Depends on the cloud provider
Some level of inherent backups or disaster recovery |
Security and governance | Handling data, especially personally identifiable data, requires care and complying with regulations | With a strong data access policy, on-prem is most secure.
Some legal or contractual requirements may not allow for cloud providers. |
Cloud vendors have security guarantees and can restrict access to employees |
Cost | Whether you pay to implement your data warehouse upfront (CaPex) or as-you-go (OpEx) depends on your implementation and affects your company’s balance sheet and tax strategy. | Avoid annual costs from cloud vendors
May be cheaper over time if resource procurements are carefully managed No cost to query your own data, and your data belongs to you |
Lower upfront costs because you don’t need to pay for infrastructure
Potential lower ongoing employee costs because you don’t need to hire employees with on-prem with skills to maintain and administer on-prem data warehouses Allows you to scale storage and server usage up and down when needed, which can lower costs if done correctly No need to buy hardware that is used only during peak times |
While these factors are important to consider, tools like Starburst can work with a wide variety of setups to help you minimize the disadvantages and capitalize on the advantages of either setup. For example, if you have an on-prem data warehouse, Starburst allows you to easily query it alongside your other data, no matter where that data is located.
Additionally, using and storing large amounts of data through a cloud-based vendor pricing can become very expensive. When your data analytics strategy becomes cost-prohibitive, Starburst can help you alleviate that burden by querying your data efficiently and separating compute costs from storage costs.
These days, most data warehouse vendors that were originally on-prem-only have added cloud data offerings. These include Teradata, Oracle, and SQLServer. Some prominent cloud data warehouse vendors include Snowflake, Amazon Redshift, and Google BigQuery.
As you’ve seen, data warehouses are an excellent solution to certain business problems and have a place in the world of data analytics even with artificial intelligence and machine learning tools. Essentially, data warehouses are most useful for housing data that has a clear and specific purpose.
What’s more, data comes in many forms and is used in many ways. For this reason, data warehouses are unlikely to serve all of the data storage needs of a business because of their limitations with file types and their inflexibility when it comes to adding new data, real-time data, raw data types to the data warehouse. In particular, many businesses may begin collecting some types of data before they know the business purpose it will serve.
Overall, data warehouses may be most useful when used in conjunction with other data storage solutions, such as data lakes. For better data integration, data lakes are another type of data storage that can store non-standard and unstructured data without transforming them first.
One challenge of storing data in multiple types of storage is aggregating them when both will be used in an analysis. Solutions like Starburst help immensely with disparate data sets and improve the lives of data scientists and data engineers. Starburst works in conjunction with both data warehouses and data lakes to help you query data, using sql, from any location. This helps data consumers and business users overcome barriers and optimize more accurate data-driven insights on a dashboard.