Analysts are often tasked with deriving insights for business units where the data can span multiple locations. This is increasingly true today when the need to make rapid decisions based on data is ever-increasing and value is measured in time to market. A few example use cases that we have seen at Starburst include:
- Financial services analysts modeling anti-money laundering and financial crimes behavior require access to siloed data spanning business units in order to trigger alerts.
- Retail companies analyzing customer spending patterns in a data warehouse with usage patterns from the data lake to drive cross-sell up-sell opportunities and create new revenue streams.
- Commercial banks improving risk-based decision-making with timely access to data from multiple source systems.
In each of the scenarios above the time value of access to data is important.
In addition to the business use cases above, the reality of today’s complex data landscape often gives rise to strategic technical initiatives that result in the adoption of multiple cloud or on-prem platforms. For example:
- Multi-cloud environments
- Hybrid on-prem and cloud environments
- Geographically diverse on-prem or cloud sources that arise as a consequence of data sovereignty issues or for historical reasons such as M&A activity or siloed departmental initiatives
Data lakes, be they on-premise or in the cloud, are often the desired storage platform of choice for analytical workloads. For practical reasons, however, it often isn’t feasible to aggregate all of the needed data in the lake. That in turn gives rise to a requirement, as noted in some of the use cases above, to access data from additional sources such as traditional enterprise data warehouses or databases, or even data lake to data lake, in order to fulfill many business intelligence or predictive modeling initiatives. Enriching data from a low-cost data lake with data from an enterprise data warehouse or database is a common pattern that we have seen. Traditionally, the solution to this problem would have meant migrating data from each source to a common repository for access. “Data Federation”, although not a new concept, provides an alternative solution where multiple disparate sources can be combined on the fly in an ad-hoc manner in a single query for analysis.
Starburst Enterprise achieves federation by creating a SQL-based “query fabric” over the top of these sources as well as other systems such as NoSQL and document databases by combining over 40 connectors and a high-performance MPP execution engine. Performance is an important factor to ensure the efficiency and positive user experience of data analysts and scientists. Several factors can play a role in a good performance, including high-speed connectors to data sources, cost-based query optimization, query/predicate pushdown, and the execution engine itself, all of which are part of the SEP platform.
In a future blog post, we will discuss Starburst Stargate, shown as part of the diagram above, that allows multiple Starburst clusters to talk to each other and enables processing of data with locality.
In this post, we discuss an additional Starburst Enterprise feature, Dynamic Filtering over relational systems and object stores, that further enhances performance, in particular for data federation use cases. Dynamic Filtering can significantly reduce the amount of data that is transferred from source to the Starburst cluster for processing. That in turn results in a large boost to performance when data transfer is an important piece of overall query execution time. Leveraging table statistics, dynamic filtering loads the smaller of the two tables into memory and only fetches rows of the larger table for the matching keys by prefiltering at the source. This is conceptually similar to what happens in an inner join when rows from unmatched keys are dropped following table scans from a traditional EDW or RDBMs. However, traditional inner joins can only be performed when selecting from a single source. Starburst Enterprise is able to achieve similar performance gains for federated queries when data needs to be combined between a data lake and a data warehouse or for simple queries on a single source by extending dynamic filtering over both data lakes and relational databases.
The results of dynamic filtering optimization can include the following benefits:
- improved overall query performance
- reduced network traffic between Starburst and the data source
- reduced load on the remote data source
Consider the following query which captures a common pattern of a fact table store_sales joined with a filtered dimension table date_dim (Note: if you aren’t familiar with how joins are performed in Starburst or Trino take a look at this episode of the Community Broadcast)
FROM store_sales JOIN date_dim
ON ss_sold_date_sk = d_date_sk
WHERE d_following_holiday=’Y’ AND d_year = 2000;
Without dynamic filtering, Starburst pushes predicates for the dimension table to the table scan on date_dim, and it scans all the data in the fact table since there are no filters on store_sales in the query. The join operator ends up throwing away most of the probe-side rows as the JOIN criteria is highly selective.
When dynamic filtering is enabled, Starburst collects candidate values for join condition from the processed dimension table on the right side of join. In the case of broadcast joins, the runtime predicates generated from this collection are pushed into the local table scan on the left side of the join running on the same worker.
Dynamic filtering is enabled by default and can significantly reduce query execution times. For analytics at scale in complex environments that include query federation, hybrid on-prem/cloud or multi-cloud environments Dynamic Filtering is an important Starburst feature that executes automatically to improve user experience and reduce execution times.