Background

Across multiple industries, organizations are choosing Starburst Galaxy to power a range of data applications. From real-time cyber threat prediction software to AI-driven recommendation engines, Starburst Galaxy makes accessing your data easy wherever it lives. Whether your first one hundred users, or your next thousand, teams can scale without incurring prohibitive operating costs or sacrificing growth.

Starburst Galaxy

In this tutorial, we will walk you through the process of creating a customer analysis application using Starburst Galaxy and PyStarburst. In this architecture, Starburst Galaxy will act as the engine for collecting, analyzing, governing, and distributing data to users.

PyStarburst and Ibis

Meanwhile, PyStarburst will be leveraged to bring Python functionalities to Starburst, using Ibis as a toolkit for expressing analytical queries.

OpenAI - ChatGPT

Finally, we've integrated with OpenAI to ask some basic questions in natural language. This will allow the data app to integrate with artificial intelligence using ChatGPT and Starburst Galaxy.

Scope of tutorial

In this tutorial, all of the required code is provided for you. To make use of it, you will need to clone the GitHub repository. You will also need to edit the code to make it unique to your environment. Finally, as this tutorial uses AI, you will also need to generate an OpenAI API key to allow connectivity between the data application and ChatGPT.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

Background

Before you jump into the tutorial, it's important to understand the functionality of the customer analysis app that you're going to build.

This data application will provide several core functions.

Core functions

Overall Goal

The ultimate goal of the data app is to understand customer segmentation by state and risk appetite.

Background

It's time to get started. You're going to begin by cloning the GitHub repository. It contains all of the code you need for this tutorial.

Step 1: Clone pystarburst-examples repository

The pystarburst-examples GitHub repo contains all of the code needed to complete this tutorial. You are going to clone it to gain access to this code.

git clone https://github.com/starburstdata/pystarburst-examples.git

Step 2: Rename .env.template file

Next, you need to change to the pystarburst-examples/apps/gradio/customer_360_ml directory so that you can rename the .env.template file. In this step, you'll remove the .template from the file name.

Later in this tutorial, you'll edit this file and add your own environment details.

cd pystarburst-examples/apps/gradio/customer_360_ml
 mv .env.template .env

Background

Now that you've cloned the pystarburst-examples repository, we strongly suggest that you review the contents of the repo to understand the purpose of the code inside it. To help you, all code files contain comments for your reference.

Review architecture

Understanding the proposed architecture is an important step when building anything. The same is true of the data app you're building in this tutorial.

The following diagram shows the high-level architecture of the app.

Review this diagram and the corresponding notes below for more details about the main components of the app.

Gradio

Gradio is a Python library that allows you to quickly create a User Interface (UI) for machine learning models. It provides a simple interface to help you build web-based applications that interact with your models. This allows users to input data, make predictions, and see the results in real-time.

Review the app.py file to see how Gradio is being used. You can find more information about Gradio and its API on gradio.app.

PyStarburst

PyStarburst brings the power and flexibility of Python to Starburst Galaxy. You are going to use PyStarburst to tie together all of the components of your data app.

To prepare for this, review the dataModels.py file to understand how PyStarburst will be used to build your data app. Take note of the way that Starburst Galaxy leverages Python DataFrames to achieve these results. This approach pushes all of the heavy lifting to Starburst Galaxy, but uses PyStarburst to achieve it.

Importantly, the code also includes a custom handler to manage the application's data. A class has been created to handle the most common tasks for future reuse in other applications. The main reason for this was to abstract the front end logic from the backend logic.

OpenAI

OpenAI's ChatGPT is used to add natural language analysis to the application. This will allow an AI-driven interface to your data app.

Review the mlModels.py file for more information on how ChatGPT will be used.

Background

Time to get started in Starburst Galaxy. This section walks you through the process of locating your Starburst Galaxy cluster user and host URL.

Later in this tutorial, you will edit the .env file to include your Starburst Galaxy cluster user and host URL. You will also add your OpenAI API key.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Starburst Galaxy separates users by role. Your current role is listed in the top right-hand corner of the screen.

Setting up a data app with Starburst Galaxy will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.

Step 3: Record cluster connection information

The cluster connection information can be found in the Clusters section of Starburst Galaxy. This tutorial uses the built-in free-cluster.

Background

Next, you'll need an OpenAI secret key to integrate your application with ChatGPT. This section outlines the steps needed to generate a new key.

Step 1: Sign into OpenAI

If you don't already have an OpenAI account, you can easily sign up for a free one.

Step 2: Generate new API key

OpenAI allows you to generate an API key. This will allow you to connect ChatGPT functionality to your data app.

Step 3: Save key

It's time to give your key a name and save it in a safe place. We recommend usinga password vault.

Background

You've gathered all the information that you need. Now it's time to set your environment variables in the .env settings file. To do this, you'll edit the file and add the values that you just recorded.

Step 1: Edit .env file

You can use your preferred text editor to edit the file. These instructions show how to use the nano text editor.

nano .env
# This file is a template for the .env file that should be created in the root of the project
# It contains the settings used by our application using Python's dotenv package
HOST="free-cluster-demo.galaxy.starburst.io" # Replace with your cluster's hostname
SB_USER="user@demo.com/accountadmin" # Replace with your Starburst Galaxy username (email) and role name (accountadmin)
OPENAI_API_KEY="sk-addyourkeyhere" # Obtain and OpenAI API key from https://platform.openai.com/signup and replace this value (it'll look like with sk-****)

Background

A Python virtual environment is a self-contained directory containing both a specific Python interpreter and its associated libraries and scripts. This allows you to work on a Python project with its own set of dependencies, isolated from the system-wide Python interpreter and from other projects.

Virtual environments are particularly useful for managing dependencies and ensuring that your project runs correctly. They also make it easier to share and reproduce your project's environment, as you can create a requirements.txt file listing all dependencies. This can be used to recreate the environment on another machine.

Step 1: Setup Python Virtual Environment

It's time to set up the Python virtual environment. To do this you need to begin by navigating to the pystarburst-examples/apps/gradio/customer_360_ml directory.

Instructions for MacOS & Linux users

cd pystarburst-examples/apps/gradio/customer_360_ml
python3 -m venv .venv
. .venv/bin/activate

Instructions for Windows users

cd pystarburst-examples/apps/gradio/customer_360_ml
python3.exe -m venv .venv
# Windows command prompt
.venv\Scripts\activate.bat
# Windows PowerShell
.venv\Scripts\Activate.ps1

Step 2: Install dependencies

Now it's time to install the requirements.txt file. This will install all of the packages needed to run the environment.

Unlike the previous step, this command is the same, regardless of operating system.

pip install -r requirements.txt

Background

You've completed all the setup requirements. Now it's time to see the customer data application in action!

Step 1: Open data application

At this point, you should still be in your Python virtual environment.

python app.py

Note: You will be prompted to sign into your Starburst Galaxy account first.

Step 2: Use data app

The data app's primary objective was to slice data and display it using a BI-like interface. It also allows you to understand customer segmentation by state and risk appetite, your main objective in building the app.

Step 3: Explore natural language feature

The app's integration with ChatGPT allows you to ask questions about the data set in plain english. A sample question has already been included with the app.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and have successfully built and deployed a customer analysis application using Starburst Galaxy, PyStarburst, and ChatGPT.

Next steps

Want to see more Starburst powered data applications in action? See how Vectra is paving the way in cybersecurity with their AI-driven threat detection and prevention platform, powered by Starburst Galaxy.

Other Tutorials

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.