Join Starburst on May 28th for Launch Point, our new product summit showcasing the future of Starburst.

Icehouse experimentation and migration

Data virtualization pitfalls
  • Daniel Abadi

    Daniel Abadi

    Computer Science Professor

    University of Maryland, College Park

Share

Linkedin iconFacebook iconTwitter icon

In our previous post, we discussed how data virtualization can be an important tool when experimenting with Icehouses and also when eventually migrating data to them. In short, experimentation with and migration to new systems both present the same type of problem: a situation in which data is stored temporarily in multiple different systems, yet still may need to be accessed together. For example, some datasets may be stored and managed by an Icehouse, and some may be stored and managed by a legacy data warehouse. Data virtualization hides the complexity of the management of datasets by different systems and provides a unified interface for applications that access data spread out across these different systems. 

In this post, we dig a little deeper into data virtualization architectures and how they accomplish the task of providing a unified interface. In doing so, we will also discuss some performance pitfalls that may arise and how to avoid them.

 

Push-based vs pull-based data virtualization architectures

As described in great detail in my O’Reilly book, all existing data virtualization systems can be classified into one of two broad categories: pull-based systems and push-based systems. 

At a high level, pull-based data virtualization systems pull data out from underlying systems and perform the majority of data processing locally. In contrast, push-based data virtualization systems attempt to push down data processing into the underlying systems in which the raw data is stored. Although hybrid architectures are possible, as described in the final chapter of the above-referenced book, modern systems can all be easily classified into one of these two categories.

Tradeoffs between push vs. pull-based architectures

Generally speaking, pull-based systems require more data transfer to perform a data analysis task than push-based systems. Pull-based systems need to extract the raw data required for the analysis out of the underlying data systems that store the data, and ship it over the network into the data virtualization system where the analysis task is performed. 

In contrast, push-based systems push down as much of the analysis task as possible to be performed inside the relevant underlying systems. Given that the raw data used for analysis is almost always larger than the output of post-analysis tasks, pushing down this analysis effort will reduce the amount of data that ultimately needs to be sent to the data virtualization system.

When network bandwidth between the underlying systems and the data virtualization system is a bottleneck, the push-based approach may help to alleviate this bottleneck. In truth, even pull-based systems will push some work down to the underlying systems, especially simple filters and aggregates that reduce the amount of raw data that needs to be sent over the network. Nonetheless, pull-based systems will almost always require more data transfer than push-based systems. 

In modern times, the pull-based approach is more frequently used

Surprisingly, most modern data virtualization systems use a pull-based approach, despite the increased data transfer requirements. This is because the push-based approach is extremely difficult to get to work correctly. Since it needs to push down query processing to underlying systems, it needs detailed knowledge of how to use the query API of each system in interfaces with. This significantly increases the complexity of the codebase, increases long term code maintenance obligations, and decreases compatibility with new storage systems. Furthermore, cost-based query optimization is extremely difficult when running over external systems whose cost for different query plans are hidden from the data virtualization engine. 

When data virtualization systems started being developed in the 1990s, they almost all used the push-based approach to avoid the network transfer bottleneck. Unfortunately, they generally failed to gain significant adoption due to their limitations in compatibility and query optimization accuracy. In modern times where high throughput networks are common, and especially in the cloud when data often has to be transferred anyway between storage and compute nodes, the pull-based approach has become much more popular. 

Push vs. pull: why you should care

The experience of using a push-based vs. a pull-based data virtualization system in the context of experimentation with new data systems and migrations across them, may be quite different. We will first discuss experimentation with new systems and then migrations.

Data virtualization performance when experimenting with new systems

As described in detail in our previous post, data virtualization is an extremely powerful tool to enable organizations to experiment with new systems. They can store an experimental or overflow dataset in a new system while leaving all the rest of the data in their existing systems, and allow the data virtualization layer to hide the complexities of having data spread out across systems. 

To understand the performance considerations that emerge in this context, let’s use the following example. In this example, all data is stored in an existing system that we will call X, except for some experimental data stored in a new system called E. In reality, most organizations will have data in more than one existing system, but this does not significantly change the discussion below. If so, then there are three basic types of requests: 

  1. Requests to access data in X 
  2. Requests to access data in E
  3. Requests to access data in both X and E (within the same request).

In many cases, the organization will not use data virtualization for the first type of requests. Since they only access data in X, and existing applications are already interfacing directly with X, there is no need to rewrite these applications to reroute requests through the data virtualization system when, ultimately, they will get forwarded to X anyway. The only reason to do so is if the organization plans to ultimately migrate data out of X (more on this below) or use the data virtualization system for the long term as the primary interface to all data within that organization. 

For what types of requests should data virtualization be used?

In theory, the data virtualization system is not necessary for the requests of type (2) either. Rather, they can be sent directly to the new experimental system. However, there is more reason to use data virtualization for this scenario than for the previous one. Since E is currently an experimental system, it may be removed at any time and replaced with a new experiment. If applications are written to interface with E directly, then they will all have to be rewritten if E gets replaced. However, if they access E’s data through the data virtualization system, they do not need to be modified if the data in E ultimately gets migrated to a new system. They simply send the same requests as before, and the data virtualization layer reroutes them to the new system. However, if the data virtualization system slows down access to data in E (more on this below), then it still may make sense to access data in E directly. 

Nonetheless, the primary reason to use data virtualization is for requests of type (3). When datasets exist in separate systems, they are essentially siloed off from each other, generally requiring access via separate requests. This makes it much harder to get a holistic view of the data within that organization and can reduce the quality of machine learning algorithms that generally work better with more data and a larger variety of datasets.  

As mentioned above, data virtualization gives the illusion to downstream applications that all data is stored within a single system even when in reality it is spread out across multiple systems. By providing a single, unified interface to these applications, it enables requests to span across systems while taking care of the complexity involved in accessing and joining this data. Thus, requests of type (3) are straightforward for data virtualization systems, but generally impossible (without significant additional code at the application level) to perform otherwise.

 

Push-based vs. pull-based virtualization for requests over a single system

For requests of types (1) and (2), all push-based data virtualization systems work the same way — they simply forward the entire request to the underlying system. In our example above, this would entail forwarding the request to X for type (1) requests and to E for type (2). If the request is not possible to directly forward to the underlying system — for example, if the underlying system uses a different API or dialect of SQL than the data virtualization system — the push-based data virtualization system will attempt to rewrite the request so that it can be submitted via the API of the underlying system. Since push-based systems have detailed knowledge of the underlying systems they are compatible with, the complexity of this rewrite effort is included in the basic functionality of the data virtualization system. 

In contrast, pull-based systems vary in how they handle requests of types (1) and (2). Some will stay true to their basic pull-based architecture and pull the raw data out of the underlying system and perform all data processing locally. Other pull-based systems will make an exception in this scenario: since requests of types (1) and (2) only access a single underlying system, the pull-based system will temporarily function like a push-based system and push the request down to the underlying system. However, this is only true if it is possible to simply forward the request. If, on the other hand, the request would require non-trivial rewrite effort before forwarding it along, most pull-based systems revert back to their pull-based methodology and pull the raw data out in this scenario, since pull-based systems do not generally have enough intimate knowledge of the API of different underlying systems to be able to perform non-trivial rewrites.

Performance pitfalls for single-system requests

Why do you need to know all of this?  Simply put, it might affect performance and user experience. If you are using a push-based system, there is usually no performance difference between whether you submit the request directly to the underlying system or whether you submit it to the data virtualization system. Since the push-based system forwards the request along, there is no real difference in how it gets processed. 

However, if you are using a pull-based system, you might see performance differences if the pull-based system chooses not to simply forward the query to the underlying system. This is because the request is now being processed by an entirely different system:  the execution engine inside the data virtualization system. If the data virtualization system is faster than the underlying system (e.g., it has more parallel processing nodes), it may be faster. Otherwise, it may be slower. 

The performance problems that may emerge

A very common reason why requests of type (1) and (2) might be slower when accessed through the data virtualization system is that there is a bottleneck in how fast the data virtualization system can extract data from the underlying system. For example, if the data virtualization system is forced to extract data via a single thread from the underlying system, it doesn’t matter how many processing nodes it has — the extraction process itself will be the performance bottleneck. 

There are tools you can use to potentially get around these bottlenecks. I plan to write a separate post to discuss some of these tools in detail. But to give one example here: sometimes, partitioning the data in the underlying system can help the data virtualization system extract data via multiple threads (one thread per partition). For example, if the underlying system uses a SQL interface, using the PARTITION BY command to organize the data on disk in partitions can improve the performance of queries sent to it from the data virtualization layer. If you are using a pull-based system, it is important to be aware of potential permanence problems and tools to get around them.

Although push-based systems will not generally run into these issues, they may run into a different set of issues. Specifically, if the push-based system is not compatible with a particular type of underlying system, one of the following problems may emerge:

  1. It may simply refuse to accept processing requests over data stored in that system.
  2. It may accept the request, but act as a pull-based system in that scenario, where it pulls the raw data out and does the processing locally. However, since most push-based systems are not designed to process large amounts of data locally (since they usually push the processing tasks down to the underlying systems), they often run into scalability bottlenecks in this scenario.

 

Push-based vs. pull-based virtualization for requests over multiple systems

For requests of type (3), there is no real choice except to use the data virtualization system, since data from multiple systems is being accessed. Although it is theoretically possible to divide such requests by hand and send them to different systems and then write some application code to do the combining, joining, or aggregating across systems, this adds a lot of complexity that data virtualization systems are designed to avoid. Furthermore, such application code is almost always slower than using data systems that are optimized for these types of workloads. 

Nonetheless, some of the performance issues that we discussed above may apply to type (3) scenarios as well. If one of the systems accessed in the type (3) requests is not configured to work with the push-based virtualization being used, it is very likely that scalability issues will emerge. And if a pull-based system is being used instead of a push-based system, some of the tricks used to increase the parallel read performance may be necessary to get optimal performance from the pull-based system. 

 

Bottom line

The short summary of all of this discussion is the following: of the three requests we discussed above, you definitely want to use a data virtualization system for requests of type (3) — since otherwise you have to write your own application code to extract, combine, and join data across systems. Furthermore, it is generally a good idea to use it for requests of types (1) and (2) as well, since it greatly facilitates migrating data across systems and plugging and playing with different storage layers. 

However, since it remains an option to avoid using such data virtualization systems for these types of requests, it is particularly important to be cognizant of performance changes that occur when using the virtualization system for them instead of submitting them directly to the appropriate underlying systems. If performance improves by using the data virtualization system, then this is all the more reason to use it. However, if performance degrades, then if you are using a push-based virtualization solution, you generally have no solutions except to switch back to sending requests directly to the underlying systems. However, if you are using a pull-based virtualization system, then there are some additional tricks that can be tried, which we will discuss in a future post.  

 

What about Icehouses?

The entire discussion I have presented so far is true for experimentation with and migration to any new system. There will always be three types of requests, and there may be performance pitfalls that emerge. There are two additional things to be aware of when experimenting with or migrating to Icehouses in particular.

First, the Icehouse architecture consists of Iceberg tables managing open format files (e.g., Parquet or ORC), sitting in distributed, scalable storage systems (such as HDFS or S3) working in conjunction with a query processing system that processes client requests over these datasets. Although different query processing systems can be used in this architecture (e.g., Presto, Trino, or Starburst), almost all of them (including all three examples I just listed) additionally function as pull-based data virtualization systems. In other words, they are designed not only to process data in Iceberg but also in many other systems as well. Thus, the Icehouse architecture almost automatically comes with data virtualization. 

Iceberg and data virtualization 

All of these data virtualization systems are designed to work with Iceberg and are pull-based systems since Iceberg itself doesn’t have its own query processing functionality that push-based systems can leverage. Indeed, using a push-based data virtualization system with Iceberg is usually a bad idea since they will be forced to use a pull-based methodology for which they were not designed, and this will often lead to scalability issues (as described above).

Thus, when experimenting or migrating to Iceberg, 98% of organizations will be using pull-based data virtualization architectures, and the discussion above regarding performance pitfalls and things to be aware of in the pull-based context is more relevant. Stay tuned for my future post that gives concrete advice on how to resolve some performance issues that may come up when using such systems. 

In the meantime, for more details about how push-based and pull-based data virtualization systems work, please see my O’Reilly book on the subject.