Join us on October 8-9 in New York City for AI & Datanova 2025!

Starburst & Spark

Complementary processing engines
  • Lester Martin

    Lester Martin

    Developer Adocate

    Starburst

Share

Linkedin iconFacebook iconTwitter icon

The best answer for Apache Spark OR Starburst? It’s Spark AND Starburst!

These two leading processing engines are complementary, not mutually exclusive, technologies. The combination of both offers a multitude of options for all aspects of data workloads, solutions, and applications. In my previous blog post, What is Spark?, I outlined how Spark and Starburst can be used together to tackle the domains of data ingestion, data transformations, and data products. 

This article expands on this topic. I will show how the combination of Spark and Starburst can be used for Machine Learning (ML) and Artificial Intelligence (AI) workflows.

What does Spark bring to the table?

First, let’s start at the beginning. Why use Apache Spark at all? The answer is simple. Spark offers a different, complementary foundation to the unified data lakehouse architecture.

Python, not SQL

As discussed in my last article, Spark’s tactic is different from Starburst’s approach to comparable workloads. Specifically, Spark workloads are centered around developing with Python. This contrasts with the SQL-based approach that Starburst and Trino leverage. This is huge for AI efforts as the majority of tools in this space are exposed via a Python API.

Dedicated workload-oriented frameworks

Spark has mature and sophisticated frameworks in the real-time streaming and machine learning technical domains. Spark’s ability to use these increases its usefulness, especially when these frameworks speak to specific use cases that matter to your organization. 

For example, Spark’s Structured Streaming offers a more complex event processing engine compared to Starburst’s standard data ingest tools. Specifically, it allows for continuously updating aggregations over sliding event-time windows. This allows for a robust set of join operations, including real-time joins with multiple streaming inputs. Now, users can benefit from whichever approach makes the most sense for them.

Additionally, Spark’s machine learning library (MLLib) is a de facto standard in the industry for running machine learning algorithms at scale in a distributed computing manner. It makes practical learning scalable and easy. At a high level, it provides tools such as ML algorithms, featurization, pipelines, persistence, and utilities.

Addressing GenAI with Starburst and Spark

The implications of these changes for AI workflows are significant. As your enterprise continues to grow, it becomes increasingly essential to assemble an AI data strategy. Any strategy is necessarily dependent on your overall tech stack and will be especially influenced by how you ingest, transform, store, and retrieve your data.

How you do this depends on the type of data that you’re using. Let’s look at a few examples. 

Structured data

Structured data is usually represented as multiple records in a single batch-oriented file. It also arises from single records in a messaging platform such as Apache Kafka. This data is represented via a schema that breaks down each record into a well-known set of attributes, as you see in the following example presented in the CSV file format.

Image depicting structured data before it has been broken into a schema.

Once ingested, this data is easily referenceable as it is now represented in a classic table.

Image depicting structured data after it has been ingested.

Fortunately, as GenAI drives the adoption of open data architecture, the traditional data lake activities remain as valid now as they have always been. Starburst’s SQL-based approach continues to fulfill the requirements for this type of input data. 

Unstructured data 

Unstructured documents are often represented in application-specific files created by document editors, spreadsheet programs, and presentation builders. Examples include HTML and PDF documents. The following image shows an excerpt from an SEC 10-K filing document.

Image depicting unstructured data before ingestion.

Preparing unstructured data requires transformation activities, such as parsing and chunking, that are not well-suited for SQL. Additionally, the chunks are used to create embeddings that need to be persisted in a vector store. Understanding What Matters for LLM Ingestion and Preprocessing provides additional details of these requirements. 

The following image helps visualize how these embeddings are stored in a vector persistence store. It represents how similar ideas, concepts, and text represented in the chunks are naturally grouped and clustered together.

Image depicting unstructured data after ingestion

Transformed data from unstructured documents is not your traditional tabular data. Spark’s Python-based approach enables this type of transformation processing. It also directly integrates with libraries, such as Unstructured.IO, that focus on preparing your documents for GenAI applications.

Furthermore, instead of storing these embeddings in an isolated vector database, Starburst recommends storing your AI data lakeside.

Using the right tool for the job

Ultimately, all of this comes down to using the right tool for the right job, something that should be familiar to anyone working in the data industry. In this specific context, it’s also important to use the right tool for the overall process. 

Are there any general guidelines? Absolutely! As the following image shows, Starburst and Spark both offer compelling solutions for ELT/ETL activities throughout the medallion architecture. Conversely, Spark provides additional tooling in the ML/AI space while Starburst shines with interactive querying and data products tooling. 

Image depicting the Starburst medallion data architecture with Land, Structure, and Consume zones.

How Spark helps Starburst + AI

With the announcement of our new Starburst AI features, the following high-level steps can be implemented with the combined technology stack of Spark and Starburst to process and leverage unstructured data in an LLM-oriented application.

  1. Extraction and ingestion: Create Spark jobs to use CDC principles to determine when unstructured documents are created or modified.
  2. Transformations: Couple Spark with tools such as Unstructured.IO to parse, clean, and chunk document types such as Word and PDF.
  3. Embeddings: Execute Starburst SQL to generate embeddings from the text chunks and to store them lakeside in Apache Iceberg tables.
  4. GenAI application: Execute AI-aware SQL in Starburst to retrieve additional context and to augment a user request into an LLM.

Using Starburst and Spark together

Starburst Enterprise and Starburst Galaxy users can leverage any implementation based on Apache Spark that they choose. These technologies are better together, especially when addressing ML and GenAI requirements.