Join Starburst on May 28th for Launch Point, our new product summit showcasing the future of Starburst.

Understanding data architecture

Building a single foundation for analytics and AI
  • Evan Smith

    Evan Smith

    Technical Content Manager

    Starburst Data

Share

Linkedin iconFacebook iconTwitter icon

The quality of your data architecture has a direct impact on your business. An open, flexible architecture enables you to leverage a single, unified platform serving data analytics, data applications, and AI. On the other hand, a closed, brittle architecture can’t pivot quickly, leaving you stuck trying to solve tomorrow’s problems with yesterday’s technology.

The future is uncertain. So are your future data needs. Today, businesses need a modern, flexible data architecture that can serve today’s top Analytics, Data Apps, and AI use cases, as well as tomorrow’s use cases (whatever they may be).

This architecture should also be capable of evolving over time to meet new needs as they emerge. This is particularly true for AI use cases, where the art of the possible is evolving rapidly. 

In this article, we’ll look at the fundamentals of data architecture and how it impacts your business. We’ll then discuss how you can shift towards a flexible data architecture that evolves in tandem with your business, and how this helps you manage risk and prepare for the future of data, including AI. 

The fundamentals of a good data architecture

High-quality and up-to-date data is the lifeblood of your business, enabling fast and accurate decision-making in response to shifting trends. To achieve this requires a data architecture that supports you in several key ways, including data access, data collaboration, and data governance. However, not all data architectures are easy to use or maintain. That’s why crafting a comprehensive data strategy to guide your architectural decisions is critical to long-term scalability and growth. 

Let’s look at each of these areas in detail before exploring the technology required in each case. 

Data architecture: Access

How data silos negatively impact your data landscape

Data access has historically been one of the largest blockers to achieving success with new data-driven projects. Corporate data is distributed across multiple data sources in different clouds, SaaS apps, and on-premises servers. To make matters worse, data is often stored in multiple systems, using different schemas, and in different formats.  All of this adds to the complexity of leveraging your data for new initiatives. 

The result is that a company ends up dotted with data silos – islands of standalone data dran from across the company. This can frustrate attempts to integrate data and make it available.

Why automatic data centralization isn’t the answer

Data centralization is powerful, but using it by default causes problems. For example, say that marketing wants to run a campaign that involves combining data from their CRM system alongside the company’s product catalog. If one team stores its data in a data warehouse while another stores it in a different schema in a data lake, gaining insights across those data sources might quickly drain technical resources. 

Historically, the solution to this problem has been to consolidate all of your organization’s data in a single location, usually a data warehouse. This drive to centralize everything often fails, however, as it requires significant time and money to implement correctly. 

Bridging the gaps with data connectors

A more flexible solution is to use a robust set of data connectors to bring together data from disparate sources. Data connectors enable a new level of flexibility in organizational data access:

  • Data engineering teams can use connectors to integrate data from various data sources via data pipelines 
  • Engineers, analysts, and business users can use a single point of entry to find data and query it via the connector

This enables a more flexible approach to centralization. Instead of centralizing everything indiscriminately, focus on centralizing only the critical datasets that require standardization and lightning-fast performance. This helps your organization to adopt a centralized, decentralized, or hybrid architecture that suits its specific data needs. 

Collaboration

Collaboration means that all data stakeholders—both producers and consumers—can easily find and use the data they need. Sadly, it’s not always that easy. There are several reasons for this. 

Technical obstacles to collaboration

Obstacles to collaboration are typically both technical and organizational in nature. Sharing data collaboratively often requires specialized technical skills, creating bottlenecks within your organization. 

Luckily, today’s data architecture platforms make it much easier for anyone in the organization with knowledge SQL to locate and use data effectively. AI assistants and natural language query interfaces are rapidly eliminating even that requirement. That’s leading to an increased push for data democratization, where any data stakeholder can access and use data that is ready for production. 

Managing data context from A to Z

But merely finding data is not enough. You also need to know its context – who owns it, what it does, where it comes from, and whether they can trust it.  To that end, there are a few key architectural components that enable collaboration and self-service access to data: 

  • Data products are curated, packaged datasets that can be published, discovered, and used across teams. While often created initially for one team’s use, data products are designed for consumption at scale, which increases both data collaboration and data accountability. 
  • Data lineage shows the journey that data takes as it flows across your data estate. Using lineage, would-be data consumers can trace back the source of data, giving them increased confidence in its accuracy and veracity. 

Mind the organizational lift, too

Even with these tools, businesses still face organizational challenges when it comes to collaboration. Departments might be risk-averse to sharing data. Teams may be hesitant to learn new tools. A significant part of enabling collaboration is educating data stakeholders on the available tools, as well as actively promoting the benefits of involving all stakeholders in the data lifecycle early to gather requirements and develop the appropriate product.

Governance

Good data governance involves monitoring the security, compliance, and quality of your data across your data estate. This becomes challenging in a distributed data architecture, as each system has its own unique security and compliance features. 

RBAC and SSO to the rescue

One of the best ways to control data governance of your data is to use access controls, like role-based access control (RBAC) or attribute-based access control (ABAC). Together, these controls limit the permissions associated with a dataset, creating ground rules to govern access. Additionally, Single Sign-On (SSO) uses a single identity to facilitate access across systems, allowing access to be provisioned and deprovisioned centrally. 

Simplified governance 

Besides making data easier to find and use, data products also simplify governance. Data producers can package high-quality, well-governed, and documented datasets into discoverable products, which data consumers can use as the building blocks for their data-driven solutions. This vastly reduces the work involved in hunting down and verifying data. 

The evolution of data architecture

Over time, the industry has evolved new data architectures to meet these needs as the demand for high-quality data has grown. This has been driven by the evolution of three data use cases: 

  • Analytics
  • Data applications
  • Artificial Intelligence (AI) and Machine Learning (ML)

Analytics 

Analytics is the original big data use case, using data to generate reports and dashboards to drive day-to-day business decision-making. Historically, it has aimed to integrate data across an organization, providing a cost-effective way to both store and query data, leading to data-led decision making.

The growing needs within the Analytics space led to three different waves of solutions

Data warehouse

Provides cheap, reliable, and high-performance access to highly structured data. Driven by the emergence of cloud data warehouses such as Snowflake and Redshift

Data lakes

Provides a less expensive option for accessing semi-structured and unstructured data, using cloud object storage to minimize costs. The leading technology driving the data lake has been Apache Hive.

Data lakehouses

Combines the best of data lakes and data warehouses with an open architecture, improved data governance layer, and improved update and schema evolution capabilities. 

Data applications

Data applications are independent, custom applications that derive analytical insights from large datasets, often derived from multiple sources. They use complex processing logic to present data to business users in an intuitive, easy-to-use manner. 

Since data applications often require access to unstructured and semi-structured data, they’ve driven the move from data warehouses to data lakes and lakehouses. These applications operate on data stored in cloud object storage, typically in Apache Hive format. 

AI & ML

While traditional AI, such as machine learning, has been around for a while, the current AI boom is being driven by generative AI (GenAI). This approach uses probabilistic neural networks and contextual data to generate new assets, including documents, chat responses, and even audio and video. GenAI opens up a whole new world of use cases, including customer service chatbots, interactive document search, code generation, and intelligent agents. And it’s evolving quickly. 

AI has driven the adoption of data lakehouses, which provide better performance and support for high-velocity data. Critically, data lakehouse formats like Apache Iceberg, Delta Lake, and Hudi provide better data governance capabilities thanks to their rich metadata support. 

How AI changes the data architecture  landscape 

AI’s probabilistic approach to data processing works best when given access to lots of data. That makes the curation of large volumes of high-quality data more critical than ever. 

Large Language Models (LLMs)

Many emerging GenAI use cases are built around Large Language Models (LLMs). LLMs are great at generating human-like text and answering general questions. However, they don’t know anything about your business. Without relevant contextual information, they can provide out-of-date answers or even return hallucinated results.

Retrieval-augmented generation (RAG)

When it comes to AI, context is king. Retrieval-augmented generation (RAG) has emerged as an effective method for providing context, reducing hallucinations, and improving both the timeliness and accuracy of LLM responses. However, gathering the data required to drive it requires a unified approach to accessing, processing, and governing data across the enterprise. 

The best foundation for a comprehensive data architecture

While the data itself and its usage may differ, data for analytics, data applications, and AI share many of the same data management needs in terms of access, collaboration, and governance. What’s needed is a data architecture—a single foundation—that’s fast and flexible enough to serve all three. 

Starburst Icehouse architecture 

This is precisely why Starburst created the Icehouse architecture: an open data lakehouse architecture that works for all use cases, from BI to AI. To do this, Icehouse architecture leverages two key technologies: 

  • Trino, an open query engine that provides fast SQL access to data sources anywhere in your organization
  • Apache Iceberg is an open table format that provides fast, easily governable centralized data storage for critical workloads

Starburst is an open data lakehouse built on the Icehouse architecture. Using Starburst, you can fulfill all three of the pillars of a modern data architecture, whether that includes analytics, data applications, or AI.

Universal access

With an open data lakehouse, you have the flexibility to centralize critical workloads while retaining access to data where it currently lives. Banco Inter, Brazil’s first digital bank, leveraged this capability to save hundreds of thousands of dollars on infrastructure costs while also accelerating time-to-insight from days to seconds. 

Easy collaboration 

Starburst supports creating new data products for better collaboration and governance. Asurion uses Starburst to reduce data quality incidents by over 50%. 

Secure governance across, wherever your data lives 

Utilizing its support for enhanced access and collaboration, companies can use Starburst to scale security and compliance regulations across their analytics, data applications, and AI initiatives. For example, Vectra used Starburst to scale security scanning using Vectra AI Investigations, improving SLA compliance and freeing up engineering teams to focus on customer-driven features. 

Learn more about how Starburst and the Icehouse can power scalability and governance across all of your data use cases – contact us today.

Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.

Functional/Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.