Data virtualization revolutionized the data infrastructure space by serving data consumers directly on top of data stores, without the need to move data elsewhere. At its core, data virtualization was born to deliver flexibility and quick(er) time-to-market to any data consumer, while mitigating many challenges that slowed down data teams, with ease of use to run any query at any time.
But, data virtualization was not originally designed to deliver widespread enterprise performance, and falls short on meeting the expectations of use cases with a required and consistent response time. Data teams often try to optimize for that, but this requires extensive and complex data-ops.
Federated governance also turns out to be a significant challenge, as inefficient brute force scans and compute scaling result in unpredictable cost structure that tends to quickly spiral out of control.
Recently, a new standard for data virtualization is emerging.
In order to fully operationalize and monetize the data lake, as well as avoid moving data to separate data silos, data virtualization must deliver interactive performance across numerous sources, significantly reduce the data-ops required to optimize for price and performance, and support the reliability for data pipeline SLAs.
Tackling legacy data virtualization performance issues
Unfortunately, the early data federation systems never entirely lived up to their promise. It turned out that building them was much harder than anybody expected.
Here are the primary reasons why:
The innovation in data infrastructure is driven by the need for simplicity and agility
Agility, ease-of-use, and performance are the driving force of the data infrastructure space. Companies are shifting attention to support high velocity development, placing fast time-to-market and maximum flexibility as top priorities.
This new approach requires decoupling data consumption from its preparation. The ultimate flexibility can only be achieved when you don’t need to move or prepare data at all (beyond the basic ETLs of course).
To ensure predictable and consistent performance, many enterprises compromise on accessing all their available data and settle for isolated data silos that have been prepared and modeled to enable speedy queries.
But the best platforms should automatically accelerate queries according to workload behavior. Data teams should have the ability to define business priorities and adjust performance and budgets accordingly. This will enable them to serve a wide range of use cases on a single data platform and directly on the data lake, with or without a data warehouse, and eliminate the need to build separate silos for each use case.
Trino optimizes for cost & performance
One of the early query engines developed to support high performance data virtualization is Trino. A distributed query engine with support for a wide range of data lake platforms, Trino gives data teams the ultimate versatility. It also delivers the core benefits of data virtualization, with no data duplication, giving administrators centralized access controls, and a shared catalog to make collaboration easier.
Trino stands apart from other solutions because of its broad support, deep extensibility, and powerful standard ANSI SQL language.
Under the covers, Trino processes queries in memory without relying on slow intermediate disk-based storage. The in-memory pipelined distributed execution engine ensures highly efficient query processing. Integrated with the in-memory pipeline processing is a cost-based optimizer. Just like a data warehouse, the query optimizer evaluates different execution plans for each query and chooses the best one based on data statistics and available resources, reducing overall processing time.
Trino includes dynamic filtering, which accelerates join performance by reducing the amount of data that queries need to process when joining tables. It also includes support for full SQL passthrough to several connectors, including major RDBMS stores.
Powerful query engine
Though most data virtualization solutions are able to read any type of data, all of the in-memory processing is optimized around a columnar architecture, ideal for analytic queries. Combined with data sources that are stored in columnar optimized formats, platforms can optimize query execution by reading only the fields that are required for any individual query.
Exposing these advanced query processing capabilities through standard ANSI SQL and JDBC connectors, it’s obvious why data virtualization solutions have become extremely popular.
Future-proof your architecture with modern data virtualization.
Legacy data visualization vendors paved the way for the data federation industry. Like many Gen1 solutions however, they were built decades ago with architectures not designed to scale to the amount of data, performance, and concurrency required to deliver true data federation in today’s enterprises.
- 10 – 100x faster query performance over other MPP engines
- 1/3rd the compute resources vs Hive, Spark & Impala
- Zero reliance on source data systems to perform joins but have the flexibility to pushdown where it makes sense to optimize performance
- ANSI SQL standard no matter where the data originates
- Proven at 1000+ node and 100+PB scale
- Performant ground to cloud, multi-cloud, and multi-region analytics on data lakes with Starburst Stargate
- No vendor lock-in to underlying data sources. Provides storage optionality
Today, as the data mesh approach to managing data within an organization continues to take off, where the emphasis has shifted towards allowing for more agility with respect to data management and eliminating centralized data management, data warehousing is going out of style, and data is becoming even more distributed across an organization. The need for federated data management is now present more than ever.