Saving pystarburst dataframe as parquet

kartik.nandakumar · March 17, 2025, 3:06am

Hi there,
Would love to understand if there is a simple way to save a pystarburst dataframe as a parquet and also how best to push this into an Aws S3 bucket. I see the code in the API documentation, but would like to see a full working example from the community.

Also, can we save a pystarburst data frame into our local working space in jupyter notebook as a parquet without having to pass it into an S3 bucket.

lester · March 17, 2025, 3:55pm

Well, the very simplest way to turn a PyStarburst DF into a parquet file(s) would be with the save_as_table() function described at Dataframe write functions — PyStarburst.

To test this out, I first spun up the Jupyter notebook from the GitHub - starburstdata/pystarburst-examples project then navigated into the tpch.ipynb notebook.

Screenshot 2025-03-17 at 11.41.42 AM

I used my Starburst Galaxy account details & credentials, but any Starburst cluster will do, and I ran all the cells up to, and including, this one.

Screenshot 2025-03-17 at 11.43.57 AM

Then I created a new cell with the code below into a catalog.schema I already had set up.

# save existing DF into a new table (note type:hive isn't needed on SEP)
tli.write.save_as_table("mycloud.messinround.asparquet", 
        mode="overwrite", 
        table_properties={"format": "parquet", "type": "hive"})

I verified it was created in the Starburst UI (I ran some queries there, too).

Screenshot 2025-03-17 at 11.46.29 AM

I then ran the following in Jupyter just to make sure it is accessible from the API.

session.table("mycloud.messinround.asparquet").collect()

All of that did create a parquet file for me in S3 as you can see.

Screenshot 2025-03-17 at 11.29.50 AM

It was so small it actually just created one “part file”, but if the DF was much larger it would have likely been spread across multiple files as that’s just how it works. And, a good thing too, since that would allow the file to be written in parallel.

Programmatically, you could probably do a few things now to get this into a local parquet file. One approach would be to:

Convert the PyStarburst DF into a Pandas DF using to_pandas() - Dataframe — PyStarburst
Converting the Pandas DF to a file with something like pandas.DataFrame.to_parquet — pandas 2.2.3 documentation (or maybe fastparquet · PyPI)

Of course, this is going to be a bad solution if the dataset is too big. Again, if that’s the case, just create a new table and let Starburst/Trino write the parquet files for you and then pull them from your bucket.

Hope that helps!