Starburst Galaxy helps Data Engineers manage their data pipelines, but this is just the start of what it can do! As a Developer Relations Engineer, I recently created a project to show how app developers can similarly benefit from adding Starburst Galaxy to their workflow. To do this, I used Starburst Galaxy and Trino to feed the backend system of data applications (Data Apps).
This blog shows you how Starburst Galaxy integrates with common software developer tools. Along the way, I’ll highlight how this tech stack benefits query performance when building embedded data applications. Why does this matter? Using Starburst Galaxy in your data app unlocks a new area of value. As a result, it expands the reach of Starburst Galaxy beyond data engineers, appealing to app developers. So if you’re a developer wanting to use more data in your apps, or a data engineer connecting a data pipeline to an application, this blog is for you.
What are data applications?
Data applications are independent, custom applications that derive analytical insights from large datasets, often involving multiple sources. They are typically used to help drive business decisions and are distinct from traditional BI applications or dashboards, which may not meet specific internal requirements regarding access rights or features for displaying and accessing data sources. Essentially, data applications represent the next step beyond standard dashboards.
Unique characteristics of these applications include data visualizations and reactive user interfaces, often packaged into an application that can deliver meaningful findings and trends.
Importantly, data applications are separate from the source systems that generated the data. This means that they make use of the data from these systems, but are distinct, custom applications. In a sense, you can think of data applications as existing in the ecosystem of the source system. Though separate, they make use of large datasets and rely on the output of these systems. In a sense, data applications are one of the most practical applications of big data. They allow the data pipelines to power applications directly enjoyed by users.
For example, imagine a custom-built application that draws in sales data from a CRM source system. This data is used to display product performance, sales metrics, and sales forecasts using regression models. This is a data application.
Starburst data applications project
This project is designed to enable software developers who need a data handling solution that integrates modern front-end technologies with a data-heavy back end system. Importantly, this combination creates a valuable synergy, helping to drive data into the hands of users. In this sense, by leveraging the power of Trino alongside modern languages like JavaScript and Typescript, data app developers have a powerful new tool at their disposal.
This project focuses on developing a rideshare analytics dashboard to derive data insights about rideshare and traffic data points in NYC boroughs throughout 2022. Specifically, analyzing how location, weather, and time variability affect these data points.
Accessing Trino with Clients
Clients and drivers act as an essential link for connecting various applications, tools, and technologies. To put it simply, compatibility of client drivers is needed to facilitate third party integrations and create a more accessible environment for the technical community to enjoy both Trino and Starburst Galaxy’s features.
In this project, we utilize a driver to leverage the power of Trino’s SQL query engine to power data requests and management in our backend system. To create this connection, we use the trino-js-client, a tool built on top of Trino’s native REST API. In essence, we use Javascript in a Node.js environment to connect to a Trino server configured in Starburst Galaxy, enabling access to the data organized in Galaxy Catalogs. Please reference this page for more information on all the clients supported by Trino.
This client driver was created for the Trino Software Foundation by software developer Felipe Regadas. In simpler terms, this partnership enabled the creation of our dynamic visualizations in our dashboard data application. I’ll cover this process in more depth in Part 2 of this blog series, so stay tuned!
Creating a data application
This project represents an exciting use case of Starburst, opening the technology to a new type of developer. For this project, I used open-source New York City rideshare data available through Kaggle. Specifically, I employed the Kaggle NYC Rideshare Analysis and Kaggle NYC Weather Data.
My approach followed the pattern below.
Using the trino-js client
To power our data application, I needed a few things. First, I needed to read data from a data lakehouse using Starburst Galaxy. Second, I needed to use it to serve our backend system.
To achieve this, I used the trino-js client. Using this client abstracts the complexities involved in directly interacting with the Trino REST API to receive and request data, making our job easier. At the same time, I was able to take advantage of Trino’s powerful SQL engine while avoiding the need for a manual server setup or dealing with the under-the-hood operations of the native REST API itself.
Next, I began by following along through the trino-js technical documentation. I created a working Trino instance by populating an object `const trino` with the correct parameters. These parameters include authentication for Starburst Galaxy with a username, password, and the correct server URL.
Using Partner Connect
The server connection information used in the Trino object instance is accessed in the Drivers and Clients section of the Partner connect area of Starburst Galaxy. After clicking on the Connection Info drop-down menu, it is important to navigate to the correct actively running cluster.
To continue with setting up our connection, I used Trino.create() method to instantiate the Trino object once populated with the correct parameters. As a result, this object residing in our backend infrastructure has full queryable access to the data stored in the data lakehouse, allowing our application to efficiently retrieve and manage data to serve to our front-end.
Our solution is included below. Importantly, this code contains the necessary parameters needed to gain a working connection for the data app.
// Trino Client connected on Galaxy Server
const auth = new BasicAuth(process.env.GALAXY_USERNAME, process.env.GALAXY_PASSWORD);
const sslOptions = {
rejectUnauthorized: true
};
const trino = Trino.create({
server:"https://emilysunaryo-free-cluster.trino.galaxy.starburst.io",
catalog: 'nyc_uber_rides',
schema: '',
auth: auth,
ssl: sslOptions
});
Note: Using our trino-js client, I needed to manually set the SSL parameter to reject unauthorized users.
Configuring catalogs in Starburst Galaxy
Next, it was time to configure the catalog. After creating this connection, I needed to ensure that the data was properly loaded and the data catalogs were correctly configured in Galaxy. To do this, I started by uploading the NYC rideshare analytics data to an AWS S3 bucket, which acted as our data lake data source. After this, I needed to properly configure our catalog connection using the correct authentication to S3. My colleague Monica Miller helped me with this essential step, so many thanks to her.
You can see the catalog connected to Starburst Galaxy in the image below.
Using Starburst Galaxy schema discovery
Next, it’s time to find your data. After creating a catalog for our data, I used Starburst Galaxy’s schema discovery feature to register the data in a tabular format within our AWS environment.
Note: For more information on this process, check out the Starburst tutorial page.
Now that our data was successfully connected, I was ready for the next step. I transformed the 11GB of rideshare data by extracting relevant tables, filtering by year, and standardizing values. For further information on Starburst’s data transformation capabilities please reference this blog here.
You can see the data landing in Starburst Galaxy below.
Query testing
Next, it was time to clean up our data and write some queries. To do this, I needed one that returned the number of ride requests, average passenger fare, average driver total pay, and average trip length on each given day of the week.
Here’s the query that I came up with.
SELECT
CASE
when dayofweek = 1 then 'Sunday'
when dayofweek = 2 then 'Monday'
when dayofweek = 3 then 'Tuesday'
when dayofweek = 4 then 'Wednesday'
when dayofweek = 5 then 'Thursday'
when dayofweek = 6 then 'Friday'
when dayofweek = 7 then 'Saturday'
END AS days_of_week,
COUNT(*) AS ride_requests,
ROUND(AVG(passenger_fare), 2) AS avg_passenger_fare,
ROUND(AVG(driver_total_pay), 2) AS avg_driver_total_pay,
ROUND(AVG(trip_length), 2) AS avg_trip_length
FROM
silver_schema.nyc_rideshare_peak_request_times
GROUP BY
dayofweek
ORDER BY
ride_requests DESC
LIMIT 1;
Importantly, this query returns several key results. These include the number of rider requests, average passenger fare, driver total pay, and average trip length on any given day of the week. To do this, I use the CASE clause to redefine each numeric value into a categorical representation of each day of the week. Additionally, I use the ORDER BY clause to order the results. Specifically, I want to return the day of the week with the highest number of ride requests as the first value.
Developing a codebase: The next step
With those steps complete, the data was now in the desired format. The next step was to develop the main codebase for the application. But that, as they say, is a story for another day.
To learn more about the actual implementation of the trino-js client within a React Application, stay tuned for my next blog.
~ Special thanks to Felipe Regadas for leading development on and graciously donating the trino-js client to the Trino Foundation. ~
Open Source Datasets
You can check out the datasets I used below.