These libraries enable you to leverage the flexibility of Python while capitalizing on the scale and performance of the leading MPP SQL query engine. This also allows for a more seamless integration into development practices like version control, CI/CD, unit tests, etc.
Fragmented tools are slowing down innovation
Starburst was originally known as a fast, interactive SQL analytics accelerator – the company behind the leading MPP query engine Trino. As popularity grew, we added more functionality on top of the core engine adding robust data governance, observability, and sharing capabilities to name a few, creating a holistic data lake and analytics platform.
Then, with the introduction of enhanced fault-tolerant execution (FTE) mode in Starburst Galaxy earlier this year, we expanded the engine’s capabilities into the data transformation space. Fault-tolerant mode allows a cluster to retry queries or parts of query processing in the event of failures without having to restart the query. This is especially useful for long-running queries that are typical of data transformation. This includes batch processing and Extract Transform Load (ETL) queries.
However, while in data analytics, SQL is the default tool, not every data engineer, data scientist, or increasingly more common, ML engineer wants to write in SQL, nor does SQL naturally handle every data problem, especially when it comes to complex transformation workloads.
This meant despite the new FTE processing mode, data engineers using Trino and Starburst still had to unload the data using Starburst drivers and process it on the client side or process the data using external frameworks like PySpark when SQL couldn’t handle the complexity of their transformation workload. This fragmented approach – and the need to maintain two separate engines for analytical and transformation workloads – adds cost, time, and complexity to the stack.
As the easy and open data lake analytics platform, we knew we had to remove that complexity for our users.
Unify your transformation and analytics workloads with DataFrame support in Starburst Galaxy at its source
To that end, we are announcing support for two Python DataFrame libraries in Starburst Galaxy that address these critical pain points and make it easier for organizations to both fast, interactive queries and complex data pipelines with Galaxy – PyStarburst and Ibis.
The introduction of these libraries means that users no longer need to stand up two separate engines for their analytical and transformation workloads to get the performance and flexibility they need. Instead, they can leverage Starburst, powered by Trino, as their single source of compute, reducing cost, complexity, and time.
With these new capabilities, data teams can:
- Build complex transformation pipelines in Starburst that are easy for teams of engineers to maintain
- Combine, curate, normalize, and aggregate data without copying it from its source system – allowing a high-performant data lake analytics platform with reduced data engineering intricacies
- Migrate existing production-grade Spark and Snowpark pipelines to Starburst minimizing code refactoring
- Build applications that process data in Starburst with software engineering best practices without moving data to the system where your application code runs
PyStarburst is a new library that we’re using to bring Python DataFrames to Starburst. It is designed to make building complex data pipelines easy and to allow Python developers to easily migrate their existing PySpark and Snowpark workloads to Starburst.
With PyStarburst, developers can build queries using DataFrames right in their code, without having to create and pass along SQL strings:
This excerpt from our simple aggregation sample demonstrates transforming a non-standard date/time format and using it in a filter. You can see the full sample here.
Under the hood, PyStarburst converts these operations into SQL that runs right inside Starburst using the same high-performance, scalable engine you already know. All of these benefits fit natively inside the Galaxy ecosystem including data products, state-of-the-art data governance like ABAC, data observability, and much more.
At Starburst everything is built with architectural flexibility in mind, and we are interoperable with nearly any data environment. We wanted to extend that commitment to the Python ecosystem by supporting Ibis – a uniform Python API.
Together, Ibis and Starburst Galaxy empower users to write portable Python code that executes on Starburst’s high-performance data lake analytics engine, operating on data from more than 50 supported sources.
Ibis decouples the execution engine from the Python DataFrame API, compiling down to SQL code for many backends – now including Starburst Galaxy. This gives the performance of Starburst Galaxy on disparate data sources with a portable API. With Ibis, you can develop your app locally and then easily switch to Starburst when you’re ready to scale.
Check out the full tutorial on getting started with Ibis and Starburst here.
You’ve seen some exciting things you can now do with Python in Starburst Galaxy. These features are in public preview and available via Partner Connect in Starburst Galaxy, so you can get started right away by creating your free account.
As part of this launch, we are also extending our partnership with Voltron Data, the company behind popular open-source Python projects like Ibis and Arrow. We are excited to partner together to reinforce Starburst’s commitment to the open-source community and help engineers build on a truly open data architecture.
Stay tuned for more work in the coming months on other Voltron Data-backed projects like Apache Arrow and other features to enable data engineers and scientists.