Best-in-class SQL query functionality has always been and remains a fundamental principle that defines Starburst’s query engine. With the recent implementation of full query passthrough via polymorphic table functions, Starburst’s powerful query engine has even more flexibility to provide functionality that was previously impossible. We recently shared this at the Comcast Data Fabric Summit, and hope to share it with you all here as well.
Full SQL passthrough (available starting SEP 393), makes Starburst an even more powerful data mesh engine than before by opening up the door to full query passthrough. Using full query passthrough, you’re able to take advantage of underlying syntax of other systems (to avoid needing to migrate really-hard-to-move queries), and capitalize on the underlying performance capabilities that are unique to each system to enhance your SELECT queries.
Comparison of Full Query Passthrough vs. Pushdown
In the past, one would use Trino to query external tables. View this example querying a table in Teradata below.
SELECT l.orderkey, SUM(l.extendedprice * (1 - l.discount)) AS revenue, o.orderdate, o.shippriority FROM teradata_direct.tpch.customer c, teradata_direct.tpch.orders o, teradata_direct.tpch.lineitem l WHERE mktsegment = 'BUILDING' AND c.custkey = o.custkey AND l.orderkey = o.orderkey AND o.orderdate < date('1995-03-16') -- MDY(3, 15, 1995) AND l.shipdate > date('1995-03-16') -- MDY(3, 15, 1995) GROUP BY l.orderkey, o.orderdate, o.shippriority;
Using full query passthrough with the Teradata JDBC connector as an example, one can pass through a query as follows, therefore executing the query within the underlying system and capitalizing on performance enhancements.
SELECT * FROM table (teradata_jdbc.system.query( 'SELECT l.orderkey, SUM(l.extendedprice * (1 - l.discount)) AS revenue, o.orderdate, o.shippriority FROM tpch.customer c, tpch.orders o, tpch.lineitem l WHERE mktsegment = ''BUILDING'' AND c.custkey = o.custkey AND l.orderkey = o.orderkey AND o.orderdate < date''1995-03-16'' -- MDY(3, 15, 1995) AND l.shipdate > date''1995-03-16'' -- MDY(3, 15, 1995) GROUP BY l.orderkey, o.orderdate, o.shippriority ' ));
Comparing the execution times between each query, the Teradata full SQL query passthrough is 2x faster than Teradata connector pushdown (5 seconds v 9 seconds).
Query with full SQL query passthrough:
Query with Teradata connector pushdown:
Analyzing the query plan of querying an external table directly via Trino (screenshot below) shows why it is often faster to invoke full query passthrough. Querying external tables executes the query as follows:
Executing the query directly via Trino requires five separate fragments. In the first stage (Fragment 4), the source reads from a table and produces a hash. In the second stage (Fragment 3), Trino scans another table and creates another hash. In the third stage (Fragment 3), Trino reads from table orders and produces a hash. In the fourth stage (Fragment 2), Trino reads another table and produces another hash. In the fifth stage (Fragment 1), Trino reads a table and joins from customers table hash and orders table hash. In the final stage (Fragment 0), Trino does the final join and then runs partial and final aggregations over the results.
How to Enable Access Controls
To operationalize full query passthrough with access controls, one will need to set Ranger or SEP Built-In-Access Controls to allow query execute privileges. This enables the query to be “passed through” to the underlying system. Access controls for query passthrough are available in SEP v391 and above. Users will need to enable Execute Function privilege (as defined in Ranger or BIAC).
Query execute functionality in Ranger:
Query execute functionality in Built-In-Access:
There are two ways to secure access to what data users are able to query. First, one may grant query execute functionality to admins, who may then create views of the data that others are able to access. We suspect the more common way to use full query passthrough is by enabling user impersonation in the connector, and having underlying engines’ access controls kick in.
Example of how to enable user impersonation in Teradata connector below:
You can then enable access to data tables in the underlying engine. If access is not granted on secured datasets, then passthrough queries will fail as follows:
When is this really awesome?
We invite you to reach out to us and share scenarios where you use query passthrough. But below are common scenarios where we think you might love query passthrough:
- Forcing a join down to the data source
- Forcing a predicate pushdown down to the data source
- Non-ANSI functions: Any time a data source specific (non-ANSI) function needs to be used, this will allow the SQL to pass through the coordinator without being ANSI validated.
- Reference tables that do not conform to Starburst’s 3 part table name (catalog.schema.table) pattern. Data sources like Snowflake, BigTable, postgres, SQL Server, etc can have their own internal 3 part table names (warehouse.schema.table) that would make adding Starburst’s catalog name a total of 4 parts.
- Data source processing push downs are not happening as expected. Potentially could be due to:
- complex grouping operations such as ROLLUP, CUBE, or GROUPING SETS
- expressions inside the aggregation function call: sum(a * b)
- coercions: sum(integer_column)
- aggregations with ordering
- aggregations with filter
When to not use query passthrough:
Full query passthrough won’t be recommended in all scenarios, key instances as follows:
- If table function invalidates statistics usage to the point that the CBO yields an inefficient plan.
- If the table function invalidates the usage of data source table partitions (available in data sources like Teradata, Oracle, etc) which dramatically speed up table access.
- Low latency workloads without the need of the table function. The increased planning/scheduling time could yield an overall slower wall clock time of a query versus allowing the data source connector to operate as normal, assuming both the table function and connector logic would equally pass the same thing down (join or predicate).
- Tableau auto-generated queries will not automatically leverage full query passthrough
- Full query passthrough will only work on read (SELECT) queries. They will not work on write (INSERT INTO) queries.
Try out full query passthrough to experience this new functionality from the recent addition of polymorphic table functions. Pick your favorite connector and compare the results yourself. Full query passthrough is available in the following connectors: Teradata JDBC, Snowflake, Druid, MemSQL, mySQL, Oracle, PostgreSQL, Redshift, SQLServer, MariaDB, SingleStore, and ElasticSearch.