Why you should run frequent data architecture audits
Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data


More deployment options
When was the last time your data engineers and data architects took a deep, hard look at your data architecture? With many organizations looking more directly at implementing Artificial Intelligence (AI) workflows, there’s never been a better time to look over your total data architecture–whether that includes analytics, data applications, or AI.
Why inertia sometimes prevails in data architecture
What’s holding you back? Likely the usual suspects. For example, you may not have felt the need to do a full data architecture audit recently because what you have “works well enough.” However, that likely means you’re not squeezing the most money you can from your investment. You may even have dozens (or hundreds) of older data pipelines that are costing you money. It also probably means that you’re not setting your current data architecture up for the AI revolution ahead.
Let’s be honest. Sometimes, we’re afraid to look. You might already suspect (or know) that you have areas of improvement. You might also worry that going down this road will become a time/cost sinkhole. (We’ve been there.)
Data architecture audits are healthy
It doesn’t have to be that way. Done correctly, data architecture audits can help you lay the foundation for a flexible approach to data integration that changes as your business changes – without starting from scratch. They can take what you have, and position you for success today and in the future–both for Analytics and AI.
In this article, we’ll look at why you should run data architecture audits, how to move towards a more flexible data architecture, and what opportunities you should look for while conducting one.
Why run a data architecture audit?
Many of today’s data systems were designed for older use cases. That’s left teams attempting to adapt these old systems to more modern business intelligence use cases for which they were never designed.
For example, data engineers may be trying to support real-time analytics on a platform initially designed for batch data processing. Or your data scientists may be working with complex data models that contain semi-structured data but using a data architecture tuned to work primarily with highly structured (relational) data in tables.
In software systems, the Agile development methodology encourages flexibility and pragmatism towards software changes, identifying and shipping small improvements that address pressing business goals. A data architecture audit can bring an Agile mindset to data systems by identifying:
- Gaps between your current use cases and architecture; and
- Choke points for existing workloads that appear only at scale (growth in users, data volume/velocity)
Benefits of a data architecture audit
Addressing issues identified by a data architecture audit can bring a range of benefits, including:
- Faster, more current data analytics reports and data-driven applications with higher data quality that support faster and more accurate decision-making for business stakeholders.
- Adding support for more modern workloads (training AI or machine learning models, prepping contextual data sets for GenAI apps, real-time data scenarios like fraud detection, etc.). All of this can generate new revenue and increase employee productivity by addressing business needs.
- Cost savings are achieved through optimizing underperforming queries, reducing the amount of compute required for existing queries, and other tools for managing your data assets, such as APIs and automation tools.
Laying the groundwork with a flexible data architecture
Sounds good, but how do you start? We’re aware that, while this might sound good in theory, thinking about your data architecture and data models from an “Agile” mindset can be challenging in practice.
You may be dissuaded from considering major architectural improvements using newer technology due to the time and cost involved. That’s understandable. When most engineers hear “architectural change,” they automatically think of a ground-up rework.
If you’ve been in the industry long enough, you’ve likely seen more than one project spin its wheels forever before ending in failure. When data centralization was all the rage, many centralization projects either got bogged down in cross-departmental conflicts or ran over time and budget due to technical challenges.
To avoid this and begin moving towards a more Agile mindset, you need technologies that support you. This includes:
- Modern data use cases via object storage and metadata-rich data table formats, such as Iceberg, Hudi, and Delta Lake
- Utilizing existing systems, data models, data flows, data sources, and data assets via federation
- Adopting a modern approach to security and data governance using state-of-the-art security tooling and automation throughout your data lifecycle
Optimizing data architecture for selective migration decisions
With these tools, you can make selective migration decisions. In other words, you can port workloads that require a more modern data architecture into formats like Iceberg while still enabling queries and joins against existing systems in your data estate.
An open data lakehouse is an approach to data management that enables this open-ended data architecture evolution. It combines support for fast, warehouse-like analytics across different data formats and sources using a SQL query engine like Trino with support for open table formats like Iceberg, which support more data types and provide better performance and data governance for modern data problems.
With an open data lakehouse, you can escape the “all-or-nothing” mentality that either locks teams into their current data architecture or results in the creation of data silos. The lakehouse doesn’t require engaging in a mass migration. It merely requires adding a couple of additional tools to your organization’s data stack to give it some additional flexibility.
What to look for in a data architecture audit
With an open data lakehouse at your disposal, you can address latent architectural issues in a more Agile manner. We suggest, at a minimum, looking at the following areas in your audit:
- Performance improvements
- Total cost of ownership
- Data security and compliance issues
Let’s examine each area and how to leverage an open data lakehouse to improve your existing data architecture in each case.
Performance improvements
One common cause of modern enterprise architecture data hiccups is that your data velocity is either too slow or too fast for your current approach to data management.
Your data velocity might be too slow if your architecture is designed to handle only highly structured data sets in a data warehouse that’s updated periodically. It can’t handle data that’s updated frequently and whose schema may be constantly evolving.
Your data velocity might be too fast if you can handle these scenarios, but your system is unable to scale to meet rapidly rising demand.
In both of these cases, moving the workload into an open data lakehouse can result in considerable performance improvements. An open data lakehouse can handle petabyte-scale data at high velocity, managing both semi-structured and unstructured data better than legacy solutions such as a Hive-based data warehouse.
Total cost of ownership
How much does it cost you to maintain your existing use cases? The cost of a workload is comprised of a number of factors, including:
- Cloud spending on compute, data storage, and licensing.
- In-house administrative costs, including personnel required to monitor and maintain the solution.
- Licensing costs.
- Hardware costs, including both capital expenses and operating expenses for on-premises workloads.
Dealing with legacy systems
For many legacy systems, ongoing administrative costs can add up. Moving services from older systems—e.g., a finicky Hadoop cluster—with a high administrative workload to an open data lakehouse run as a managed service can reduce the total cost of ownership. Date lakehouse formats also provide more efficient use of compute resources, resulting in a lower overall solution cost.
An open data lakehouse also provides tools you can use to control costs by managing workloads to optimize price/performance tradeoffs, using techniques such as:
- Workload monitoring
- Query routing
- Caching of frequently accessed data
- Enhanced autoscaling automation
- APIs and other tools for workload automation
Data security and compliance issues
An older data architecture focused primarily on relational data might be harder to govern if it predates modern data regulations such as the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA). Solutions such as data warehouses, which are focused on structured data, may not provide the tools you need to properly secure other data structures.
Moving existing solutions that require an additional level of data security monitoring into a more modern solution, such as the open table format supported by an open data lakehouse, makes it easier to keep sensitive data out of the wrong hands. These may include mechanisms such as role-based access control (RBAC) and attribute-based access control (ABAC) to give users access to only the data they need based on their job functions.
Open data lakehouses also provide the flexibility to support your data governance needs. For example, you can control data access in countries with strict data sovereignty laws.
Because of its flexibility, you can change the way you manage your enterprise data in your open data lakehouse as applicable laws and regulations change. This is especially important for artificial intelligence (AI) workloads, given that the legal frameworks around generative AI apps are still evolving.
Starburst: The flexible, open data lakehouse
The data world is rapidly evolving. Your approach to data management needs to evolve along with it. This includes all data workflows: analytics, data applications, and AI.
With the right open data lakehouse, you no longer have to be stuck in the past. You can improve your existing approach to data integration in an Agile and incremental fashion while simultaneously future-proofing it against the unknown.
Starburst provides an open data lakehouse based on an Icehouse architecture—a combination of Trino and Apache Iceberg. It addresses your needs for architectural flexibility by supplying:
- Petabyte scalability for your most demanding workloads.
- An architecture built for scale, that moves as your business moves and can evolve alongside your evolving data strategy.
- Easy deployment from anywhere.
- Key security and performance improvements built on top of the already stellar Trino.
Want to see how Starburst can perform on your most demanding modern workloads? Try it for free today.