Parts of this blog were co-authored by Ashish Thapliyal Principal Program Manager, Azure HDInsight
Starburst Data is excited to have our latest release, Starburst Presto 302e, be a part of the Azure HDInsight Application Platform.
With Starburst’s availability, HDInisight users have access to a truly enterprise ready version of Presto that includes key security, reliability, and performance enhancements.
Presto is an open-source, fast and scalable distributed SQL query engine that allows you to analyze data anywhere within your organization. Architected for the separation of storage and compute, Presto can easily query data in Azure Blob Storage, Azure Data Lake Storage, SQL and NoSQL databases, and other data sources.
Adding Presto gives HDInsight users two things:
- A fast, scalable, interactive SQL interface to data in Azure Blob and Azure Data Lake Storage.
- An easy way to create queries that integrate data in Azure Blob and Azure Data Lake Storage with other sources by leveraging Starburst Presto’s vast portfolio of data connectors.
Starburst Presto complements other existing open source components on HDInsight such as HBase, Storm, Spark, R, Kafka, and Interactive Query. This further enables customers to use the open source tools most suited for their workloads.
Starburst Presto delivers fast performance (enabled via cost-based query optimization), enhanced security features, and integration with Azure and HDInsight services such as:
- Azure Blob Storage
- Azure Data Lake Storage
- External Hive Metastore
- Use Microsoft PowerBI to access data via Presto
Additionally, in our latest release, 302e, we’ve improved our security, connectivity, and performance enhancements even further.
When we first started working on Presto, enterprise security was a large focus. Over the years, we’ve added security features such as Kerberos, LDAP, Encryption in-transit, Apache Ranger integration, Apache Sentry integration, and more. Our commitment to security in Presto continues. In 302e, we added security auditing capabilities to log information such as the SQL query and user who submitted.
Additionally, with our Apache Ranger integration users can leverage Apache Ranger and Solr integration for further security auditing. We introduced Apache Ranger integration in mid-2018 and continue to improve on the integration. Lastly, in 302e, we’ve added the ability to enforce policies on Presto native and user defined functions and added row level filtering.
In 302e we’ve introduced connectors and compatibilities to additional data sources further extending using Starburst as your data fabric and query consumption layer.
- We introduced a generic JDBC Connector that allows one to connect to other JDBC data sources not included as named connectors in Presto. This is useful if Presto does not yet provide a connector such as Netezza, DB2, Vertica, Greenplum, and many others. We’ll continue to provide updates on our roadmap for additional connectors. And we’re always interested in hearing your feedback for what connectors to add next.
- The Elasticsearch Connector allows one access to Elasticsearch data from Presto. This was contributed to the Presto community and we now officially support it.
- In addition to connectors, we also recognize extending Presto’s function compatibility. In 302e we added Oracle compatibility functions allowing your Oracle users to use the functions they are comfortable with.
Query Performance & Scale
Presto was designed from the ground up with performance and scale in mind. Our latest release continues to add improvements in these areas. In 302e, table and column statistics collection via a new ANALYZE command in Presto was added. This ANALYZE command is native to Presto which means you no longer are required to have a Hive installation to leverage statistics collection. Data statistics are needed if you want to leverage the Presto Cost Based Optimizer, a feature we added at the beginning of 2018. Previously, one would use Hive runtime to collect the statistics. Now you can use the ANALYZE command in Presto to collect the statistics for Presto leveraging Presto’s execution engine performance. Adding the ANALYZE command greatly simplifies your Presto installation.
We have also extended Presto’s spill to disk feature used by Joins and Aggregations to also support SQL Order Bys and Window Functions. In doing so, users have the ability to scale query processing to extremely large amounts of data for these types of SQL queries. This was accomplished by writing intermediate results to disk when there was not enough aggregate memory to keep the processing entirely in memory
Starburst Presto (302e) on Azure HDInsight can be found on the Azure Marketplace.
The rest of this post describes architecture concepts and how to get started with Starburst Presto on Azure HDInsight.
How it works
Presto is deployed as an application on Azure HDInsight and can be configured to immediately start querying data in Azure Blob Storage or Azure Data Lake Storage. Leveraging the external metastore feature on HDInsight, it allows Presto to share metadata with other clusters such as Hive and Spark.
Presto for HDInsight can be used with tools such as Microsoft’s PowerBI and Tableau. We’ve also packaged it in the open source Apache Superset for Business Intelligence which is installed and running automatically when choosing Presto on HDInsight. It’s an easy visual way to try out Presto for the first time.
The Presto Coordinator and Worker architecture is similar to HDInsight’s Head node and Worker nodes architecture. When deploying Presto as an application on HDInsight, the Presto coordinator is deployed to one of HDInsight’s head nodes and the Presto workers are deployed on HDInsight’s worker nodes. Additionally, an edge node that is deployed contains the Presto Command Line Interface (CLI) and Apache Superset.
One incredibly useful feature is the ability to connect to an external Hive Metastore. It shares metadata between different tools such as Presto, Hive, and Spark, and it’s independent of the Presto cluster lifecycle. This allows you to shutdown the Presto HDInsight cluster when not in use to save costs.
Another useful feature in HDInsight is the ability to manage clusters by scaling up or down. Presto was architected from the ground up for separation of storage and compute. This allows Presto to work seamlessly with HDInsight to elastically scale up or down depending on your business demands.
Starburst also provides a set of Script Actions for common operations such as updating the Presto configurations. For example, you may want to configure a new connector to query from Microsoft SQL Server. This allows you to run federated queries between the RDBMS and Azure Blob storage. You can read more about the script actions in our documentation.
Getting started with Presto on HDInsight
Starburst Presto can be selected as an application on Azure HDInsight. Simply choose Starburst Presto and continue with the HDInsight setup.
Additionally, Starburst Presto on Azure HDInsight can be found on the Azure Marketplace which redirects you to the Azure Portal with Presto for HDInsight specific parameters pre-filled, creating an even simpler setup experience for users
Once HDInsight and Presto are deployed, you can view Presto as an installed application
Configuring Starburst Presto support for Azure Data Lake Storage and Azure blobs
Presto for HDInsight can be configured to query Azure Blob Storage and Azure Data Lake Storage. Azure Blobs are accessed via the Windows Azure Storage Blob. This layer is built on top of the HDFS APIs and allows for the separation of storage from the cluster. This is key for scaling Presto and HDInsight independently of storage. During setup, simply choose the desired storage account and we’ll configure it automatically for you.
Our solution will automatically configure Presto to read from Blob Storage. However, if you were to configure it manually, your hive.properties configuration file would contain:
Similarly, for Azure Data Lake Storage your hive.properties, your Presto configuration would contain:
Conclusions and next steps
HDInsight provides a number of open source engines. Each tool has a specific fit depending on the use case. If interactive SQL analytics are needed, Presto is the best fit. Additionally, Presto has the unique ability to federate across different data sources in Azure.
Starburst Presto contains many security integrations such as LDAP/AD, data encryption in transit and at rest, as well as Apache Ranger integration. In a future release, we will integration with HDInsight Enterprise Security Package to automatically configure the security for you in a few simple clicks.
At Starburst and Microsoft, we’re committed to advancing the open source Presto project forward. Give Presto a try today on Azure Marketplace!