Saving pystarburst dataframe as parquet

Hi there,
Would love to understand if there is a simple way to save a pystarburst dataframe as a parquet and also how best to push this into an Aws S3 bucket. I see the code in the API documentation, but would like to see a full working example from the community.

Also, can we save a pystarburst data frame into our local working space in jupyter notebook as a parquet without having to pass it into an S3 bucket.

Well, the very simplest way to turn a PyStarburst DF into a parquet file(s) would be with the save_as_table() function described at Dataframe write functions — PyStarburst.

To test this out, I first spun up the Jupyter notebook from the GitHub - starburstdata/pystarburst-examples project then navigated into the tpch.ipynb notebook.

Screenshot 2025-03-17 at 11.41.42 AM

I used my Starburst Galaxy account details & credentials, but any Starburst cluster will do, and I ran all the cells up to, and including, this one.

Screenshot 2025-03-17 at 11.43.57 AM

Then I created a new cell with the code below into a catalog.schema I already had set up.

# save existing DF into a new table (note type:hive isn't needed on SEP)
tli.write.save_as_table("mycloud.messinround.asparquet", 
        mode="overwrite", 
        table_properties={"format": "parquet", "type": "hive"})

I verified it was created in the Starburst UI (I ran some queries there, too).

Screenshot 2025-03-17 at 11.46.29 AM

I then ran the following in Jupyter just to make sure it is accessible from the API.

session.table("mycloud.messinround.asparquet").collect()

All of that did create a parquet file for me in S3 as you can see.

Screenshot 2025-03-17 at 11.29.50 AM

It was so small it actually just created one “part file”, but if the DF was much larger it would have likely been spread across multiple files as that’s just how it works. And, a good thing too, since that would allow the file to be written in parallel.

Programmatically, you could probably do a few things now to get this into a local parquet file. One approach would be to:

  1. Convert the PyStarburst DF into a Pandas DF using to_pandas() - Dataframe — PyStarburst
  2. Converting the Pandas DF to a file with something like pandas.DataFrame.to_parquet — pandas 2.2.3 documentation (or maybe fastparquet · PyPI)

Of course, this is going to be a bad solution if the dataset is too big. Again, if that’s the case, just create a new table and let Starburst/Trino write the parquet files for you and then pull them from your bucket.

Hope that helps!

1 Like

hi @lester ,
I do apologise for the late reply, and thank you for this.

i was wondering if you have any working examples for datafarme.Copy_into_location(), which i believe can also be used to write a parquet to an S3 bucket.

if i understand you correctly, there is no way to convert a pystarburst dataframe into a parquet and store in my jupyter workspace without converting to a pandas dataframe first…

I believe the copy_into_location() function you are talking about belongs solely to Snowflake’s Snowpark API and, in fact, it does the same thing about storing on an object store. In their case, you can save to a Snowflake ‘location’ which is essentially a bucket.

But you are right… you cannot directly save a PyStarburst Dataframe object into Parquet. My earlier suggestion presented above (just do a CTAS and pull from S3 afterwards) is still my best recommendation.

If you want to try and hurt your brain, I guess you could use collect() and then loop through that collection of Row objects all while building Parquet files yourself. Of course, writing to column-oriented file formats is some pretty low-level stuff. You could look to using PyArrow or FastParquet liked called out in All About Parquet Part 08 - Reading and Writing Parquet Files in Python - DEV Community, but I think you’ll still need to convert the PyStarburst DF to a Pandas DF BEFORE you try to save it locally as a Parquet file.

So, in a nutshell, this is going to be difficult with any distributed compute engine as the DFs being used are NOT actually holding data – they are holding logical footprints of how to do work once an action-oriented function is called.

1 Like