Clickstream Analytics with Starburst Galaxy Data Ingestion and AIDA

Getting from Kafka to insights faster with the Starburst AI Data Assistant (AIDA)

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

At Starburst, we leverage Kafka topics for stream event processing, and one of the ways we use them is to track user events. Like most engineering teams, we want to turn that stream into something useful without signing up to maintain yet another data pipeline. This post is the story of how we built that functionality using Starburst Galaxy’s data ingestion feature and AIDA, Galaxy’s conversational AI interface.

This story also comes with a twist. 

The data we’re analyzing is usage telemetry pertaining to Galaxy’s data ingestion feature itself. So yes, we’re using data ingestion to understand how people use data ingestion. It’s as meta as it sounds, and has been a really fun project to work on!

Understanding how users adopt data ingestion

Setting up Kafka ingestion in Galaxy is a straightforward, guided process. But as the team was building this feature, we had a set of questions the setup wizard couldn’t answer. 

These included: 

Feature adoption

The guiding question here was how quickly are users discovering and trying data ingestion after it becomes available to them?

Usage friction

For this, we considered the user experience from their perspective. Where in the setup flow do users get stuck or give up entirely?

Common errors

Next, we asked what goes wrong most often and whether there is something we can do in the UX to prevent it.

Time-to-value

Finally, we considered how long it actually takes someone to go from first discovering the feature to having a successfully running live table.

These are the kinds of questions that product and engineering teams are always asking. And if you’ve ever tried to answer them with static dashboards, you know how it goes. By the time you’ve built the dashboard that answers today’s question, the team has already moved on to three new ones.

What we track (and what we don’t)

Our application emits tracking events to a Kafka topic. We used Galaxy data ingestion to stream those events into a Starburst Galaxy live table, which is essentially a managed, continuously updated Iceberg table backed by Amazon S3

Here’s what the events capture:

Ingestion source CRUD events

This included creating, viewing, updating, and deleting ingestion source configurations.

Live table CRUD events

This event tracks the same lifecycle operations for live tables themselves.

Verify events

We also verify each event. This process involves test connection style actions, where a user validates their ingestion source or live table configuration before committing to it.

Partitioning and sorting configuration

This metric tracks whether users apply custom partition columns or sort orders to their tables.

Determining the user

It’s worth noting that one thing we deliberately don’t track is a user’s identity. Every event carries an anonymized user ID, so we can analyze behavior patterns and cohorts without ever knowing which person took which action. Privacy by design. It’s important to us to learn from usage patterns without compromising anyone’s identity.

On the Iceberg side, data first lands in a raw table, which directly represents the Kafka messages. From there, a transform table reshapes things into a query-friendly schema. This step flattens nested fields, casting types, and filtering out unnecessary columns. Once that’s in place, the data is ready to query.

Here’s a quick final check to confirm events are flowing:

SELECT *
FROM usage_events
WHERE name LIKE '%ingestion%'
ORDER BY time DESC
LIMIT 10;

Events show up within minutes of being produced to Kafka. No batch jobs to schedule, no orchestrator to keep an eye on – it just works.

AIDA provides conversational analytics, not dashboards

AIDA is Galaxy’s AI-powered data agent. You point it at a catalog, ask a question in plain English, and it generates SQL, runs it, and hands you the results within the flow of a conversation.

If you’ve ever needed to answer a product question the traditional way, you know the drill

  1. Write some SQL (or find someone who can)
  2. Run it
  3. Export the results
  4. Maybe build a chart in a BI tool
  5. Share it in Slack
  6. Get a follow-up question that requires a slightly different query, and repeat the whole cycle. 

With AIDA, you skip all of that and just ask the question using natural language.

Example: AIDA in action

Let’s walk through a real example. We prompted AIDA with questions around data ingestion. 

Here are some real prompts we used. 

Prompt: “What are the top 10 most common errors during ingestion source verification this month?”

Image showing common errors during ingestion source verification.

Result: The images below show the results. AIDA answered our questions perfectly and used the available context to deliver powerful results. 

Image showing results for Starburst AIDA querying common errors during ingestion source verification.

Digging deeper with AIDA 

Next, we asked it a follow-up question that builds on the last results and digs deeper. AIDA naturally handles this depth, allowing users to follow a hypothesis or line of inquiry. 

Prompt: “I want to understand friction points when users try to create ingestion sources and live tables.”

Starburst AIDA when prompted to understand friction points when users try to create ingestion sources and live tables

Result: AIDA returns information on the friction points as asked, including relevant details about ingestion sources and live tables.

Image showing Starburst AIDA's results for a friction points when users try to create ingestion sources and live tables.

Changing the direction of questioning 

Sometimes one answer begets an entirely different question. To help with this, AIDA allows users to change direction with their questions. Let’s look at the following example. 

Prompt: “How many users have customized table partitioning or sort order?”

AIDA prompt for how many users have customized table partitioning or sort order.

Result: AIDA helps track down errors and identify problems. Because it has access to all of the context it needs, the insights are…insightful. Consider the following example. AIDA is able to return results of this type far more easily than conventional methods. 

AIDA result for how many users have customized table partitioning or sort order.

Prompt: “Are there any correlations between the Kafka source errors and the authentication type (such as SASL/PLAIN or SASL/SCRAM)?”

How AIDA lets you move faster and more dynamically using conversation

Getting each of these answers would have been a real pain in the neck in the old world. I’ll be honest, I don’t write SQL often enough to remember syntax off the top of my head, so the process usually involves a bunch of searching through the Trino SQL reference, followed by trial-and-error until the query works. 

With AIDA, the loop from question to answer is measured in seconds, and I can immediately iterate and narrow things down with follow-up questions, like “now break that down by month” or “exclude Starburst internal accounts” without having to start over. The difference in speed and scale is significant. 

How Starburst leverages this power using a meta loop

Now for the meta part. There’s something genuinely satisfying about using your own product to understand itself. We built data ingestion so that users could stream Kafka data into a high-performance Iceberg data lake without managing infrastructure. Then we used that exact same feature to stream our own usage telemetry into Iceberg. And then we used AIDA to ask questions about the data. These questions directly informed the next round of improvements to the ingestion feature itself.

How the meta loop works in practice

Our meta loop surfaced things we wouldn’t have thought to include in dashboards. 

Here’s a good example. 

  1. While poking around in AIDA, we noticed that a surprising number of users needed 10+ verification attempts to successfully create a live table. 
  2. A few follow-up questions helped us dig in, and it turned out the verify step wasn’t giving clear enough feedback on Kafka connection failures.
  3. Users couldn’t tell whether it was a network connectivity issue or incorrect credentials, so they just kept retrying. 
  4. That insight led directly to a UX improvement, and the whole thing started with a casual question in a chat window, not a pre-planned dashboard panel.

Each of these insights is immensely valuable, and we’re already working to implement iterations and improvements based on the results. 

AIDA’s next step is visual, and that has big implications for dashboards

There’s something else. AIDA already eliminates the need to write SQL or configure dashboards for exploratory analysis. Today, results come back as text. 

That’s about to change.

AIDA visualization support

Visualization support is in active development. The idea here is that you’ll soon be able to ask a question and get a chart back, not just numbers and text. When a conversational AI can both query and visualize, the emerging workflow collapses the following traditional BI tasks into a single conversational thread, allowing you to: 

  • Write a query
  • Build a chart
  • Tweak filters
  • Share a dashboard
  • Field requests for changes

That’s a pretty significant shift.

What AIDA means for dashboards

AIDA is already disrupting dashboard workloads, and that trend is only going to continue. Now, this doesn’t mean dashboards disappear overnight. Recurring KPIs and shared team views still benefit from a pinned dashboard that everyone can glance at. 

But the balance is shifting. 

The exploratory, iterative, “I just have a quick question” kind of work (which is most of what engineers and PMs actually do day to day) is moving to conversational interfaces. BI tools become the static reporting layer. The AI interface becomes where the actual thinking happens.

How we used AIDA to improve Starburst itself

In summary, we ingested a stream of usage-tracking events into Kafka, pointed Galaxy data ingestion at it, and had a queryable Iceberg table in minutes. No infrastructure to manage, no jobs to monitor. Then we used AIDA to ask questions about that data in plain English and got answers immediately – answers that directly shaped the product we were building.

If you’re sitting on event streams and find yourself spending more time building pipelines and dashboards than analyzing your data, this is worth a look. The Galaxy data ingestion docs are a good place to start.

Excited about AIDA? 

We are too. We have a lot more coming, so stay tuned. 

We’re working on a follow-up post covering best practices for using conversational AI agents for data exploration. It will include insights into prompt strategies, common pitfalls, building trust in the answers, and how to get the most out of tools like AIDA.

 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free