Last Updated: 2024-08-14
PyStarburst is a library that brings Python DataFrames to Starburst. It is designed to make building complex data pipelines easy and to allow Python developers to easily migrate their existing PySpark and Snowpark workloads to Starburst.
This tutorial will walk you through setting up PyStarburst on your local machine and running some basic PyStarburst code against a Starburst Galaxy cluster.
Once you've completed this tutorial, you will be able to:
As the name PyStarburst suggests, you need Python installed. You'll also need pip
, the Python package manager. These instructions will use brew
for installation.
brew install python
python3 –version
pip
pip
and check the installation:python3 -m ensurepip
python3 -m pip install –upgrade pip
pip –version
Now that you've got pip
installed, you can use it to install PyStarburst.
pip install pystarburst
Starburst Galaxy provides a connection file for easy connection to your preferred cluster. All you need to do is download the file from your Starburst Galaxy account and edit the credentials to make it unique to you.
tpch
catalog, as we will be using that catalog throughout this tutorial.main.py
will be downloaded.The file you just downloaded has your user and host information prefilled. You just have to supply the password associated with your Starburst Galaxy account. You'll also add some code to import required libraries and run a simple command to show the tpch.tiny.region
table.
main.py
file you downloaded.
with your Starburst Galaxy account password.from pystarburst import functions as f
from pystarburst.functions import col
session.table("tpch.tiny.region").show()
It's time to test the boilerplate code in the connection file. You can execute this code right in your terminal window.
python3 main.py
You just used the API to run a SELECT
statement on the tpch.tiny.region
table. In fact, if you take a look at Query insights in your Starburst Galaxy account, you'll see that the query was run.
Thus far you've connected to your Starburst Galaxy cluster and used the PyStarburst API to run a simple SELECT * on sample tpch
data. In this section, you'll continue exploring the API library and become familiar with some of the commands available to you. Please note that all of the code you write will go in the main.py
file.
You already saw how to use the show()
function to select a table in the last section of this tutorial. This time you're going to also use the table()
function to first grab the tpch.tiny.customer
table before using the show()
function to display it. Note: if no argument is supplied to the show()
function, it defaults to listing ten rows.
main.py
file, then save.custDF = session.table("tpch.tiny.customer")
custDF.show()
python3 main.py
As promised, we can use the select()
method to limit the number of columns returned from the DataFrame. There is a compensatory drop()
function that would be better if we wanted to keep most of the columns and only remove a few.
projectedDF = custDF.select(custDF.name, custDF.acctbal, custDF.nationkey)
projectedDF.show()
You can use the filter()
function to filter the rows returned. In our case, we will use it to return the customer records with the highest account balance values.
show()
function.filteredDF = projectedDF.filter(projectedDF.acctbal > 9900.0)
filteredDF.show(100)
Later, we are going to join the customer
records to the nation
table to get the name of the country, not just a key value for it. In this step, we are chaining methods together instead of assigning each output to a distinct variable as we have done up until now.
nationDF = session.table("tpch.tiny.nation") \
.drop("regionkey", "comment") \
.rename("name", "nation_name") \
.rename("nationkey", "n_nationkey")
nationDF.show()
rename()
function simply changes the column name.Now we can use the join()
function to join two DataFrames using their nationkey
values.
joinedDF = filteredDF.join(nationDF, filteredDF.nationkey == nationDF.n_nationkey)
joinedDF.show()
join()
function did not remove unwanted columns. We have all columns from both DataFrames.We can use the drop()
function again to clean up the unwanted columns in our join.
projectedJoinDF = joinedDF.drop("nationkey").drop("n_nationkey")
projectedJoinDF.show()
Let's use the sort()
method to sort the projectedJoinDF
by account balance.
orderedDF = projectedJoinDF.sort(col("acctbal"), ascending=False)
orderedDF.show()
We've created multiple DataFrame objects in this tutorial for educational purposes. However, in practice most DataFrame API programmers chain many methods together. For example, the following code would produce the same result as the steps in the previous section.
nationDF = session.table("tpch.tiny.nation") \
.drop("regionkey", "comment") \
.rename("name", "nation_name") \
.rename("nationkey", "n_nationkey")
apiSQL = session.table("tpch.tiny.customer") \
.select("name", "acctbal", "nationkey") \
.filter(col("acctbal") > 9900.0) \
.join(nationDF, col("nationkey") == nationDF.n_nationkey) \
.drop("nationkey").drop("n_nationkey") \
.sort(col("acctbal"), ascending=False)
apiSQL.show()
There is a lot more going on with the PyStarburst implementation including the lazy execution model that the DataFrame API is known for. In a nutshell, this simply means that the program waits until it absolutely needs to run some code on the Starburst Galaxy engine.
If we replace all of the code we have written so far with the code above, only a single SQL statement will be sent to Starburst Galaxy — and again, you can find it in the Query insights page.
main.py
file, replacing the previous code that was written in the last section of this tutorial. Do not delete the initial import and connection code.Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've learned a bit about how to use PyStarburst, we encourage you to explore the API documentation and write your own code.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!