How Starburst Galaxy Uses Data Ingestion to Complement Trino

Starburst Galaxy Release Post

December 10, 2025

Ahmed Niyaz

Product Manager

Starburst

Nick Kessler

Product Marketing Manager

Starburst

Evan Smith

Technical Content Manager

Starburst Data

Ahmed Niyaz

Product Manager

Starburst

Nick Kessler

Product Marketing Manager

Starburst

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Introducing the Great Lakes Connector for Starburst Enterprise

Starburst Galaxy is designed to be the easiest and best way to deploy Trino in the cloud. How we fulfill that mission is constantly improving, iterating, and expanding.

This blog post will unpack what’s new in Starburst Galaxy over the last quarter, and how we are continuing to improve and enhance the feature offerings. In particular, this post will unpack the constellation of features that make up the Managed Iceberg Pipeline feature set. Collectively, these features allow users to:

Ingest data from multiple sources, including Kafka streaming and Amazon S3
Support for smarter Iceberg Table partitioning and maintenance
Enhanced observability and control of workloads via Galaxy ingestion metrics

Collectively, these features handle data ingestion and data maintenance, ensuring that Starburst Galaxy isn’t simply the easiest way to use Trino, but the easiest way to ingest data as well.

Managed Iceberg Pipelines continues to progress

Starburst Galaxy is designed to be the very best implementation of Trino available on the market, and this extends to making it easily available through data ingestion.

Our Starburst Managed Ingest feature delivers a superior, streamlined experience for loading data directly into your Iceberg data lakehouse. Starburst Managed Ingest makes data loading, optimization, and preparation effortless, enabling you to move from raw data to high-performance insights in your Iceberg tables faster than ever before.

What does this mean in practice? In reality, we’re looking at two forms of data ingestion

Streaming ingest
File ingest (via Amazon S3)

Streaming ingest

This release strengthens Galaxy’s native streaming ingestion engine, enabling continuous ingestion of high-throughput event streams directly into Iceberg tables. The image below shows the operational overhead of a typical ingestion pipeline.

Image depicting a complex toolchain needed to run streaming ingestion workflows. This contrasts with a simpler approach that users can deploy when using Starburst.

What kind of streaming can be ingested?

The system works by processing records as they arrive from Kafka, bypassing the latency and operational overhead of batch-oriented pipelines. Incoming data is parsed, validated, and written to Iceberg using append-optimized paths that preserve schema and snapshot integrity.

What are the benefits to users?

These enhancements improve end-to-end freshness for downstream analytics and AI workloads, supporting use cases such as regulatory audit trails, operational telemetry, and real-time customer features. The streaming engine is designed for high concurrency and sustained ingest rates while ensuring fault tolerance, backpressure handling, and predictable recovery semantics.

Want to know more about Starburst Galaxy Kafka streaming ingest? Check out this webinar.

File ingest

File ingestion does for Amazon S3 what streaming ingestion does for Kafka. As with streaming ingest, Starburst Galaxy removes much of the operational friction traditionally associated with processing raw files into an Iceberg table.

Image depicting the data architecture used by Starburst streaming and file ingestion. In the image, streaming data from Kafka is ingested alongside file data from AWS S3. The data is ingested into an Iceberg raw table, and then automatically transformed into an Iceberg Live table using Starburst.

How does Starburst Galaxy file ingest work?

Galaxy can now ingest datasets directly from Amazon S3 prefixes and hydrate them into Iceberg tables without external ETL pipelines, AWS glue workloads, or orchestration frameworks. Engineering teams no longer need to stitch together ad-hoc scripts to watch buckets, parse JSON, manage schema drift, or rebuild tables manually. Galaxy automatically detects new files, validates and structures the data, updates Iceberg metadata, and handles compaction, ensuring that tables remain fast and query-ready.

These features can also be accessed using Starburst Galaxy APIs.

What are the benefits to users?

This dramatically simplifies common workflows like log ingestion, clickstream capture, and scheduled vendor exports, turning basic Amazon S3 file drops into reliable, production-grade data pipelines. The result is faster access to cloud data, less pipeline maintenance, and a more scalable foundation for downstream analytics and AI.

Data maintenance

Iceberg performance depends on continuous upkeep. As tables grow, they accumulate metadata files, snapshots, and small data files created by streaming writes, MERGE-on-read operations, and schema evolution. Left unmanaged, these artifacts inflate planning time, increase storage costs, and eventually degrade query performance. Iceberg solves many of the problems that Hive had, but its design still requires routine maintenance to keep the lakehouse running efficiently.

At its core, Iceberg maintenance focuses on four operations:

Compaction

Iceberg’s write path generates many small files. This increases the number of files the engine must open and scan during planning. Compaction rewrites these into fewer, larger files, improving scan efficiency and reducing metadata load. It is the single most impactful operation for keeping large Iceberg tables fast.

Snapshot expiration

Every write creates a new snapshot. Snapshots allow time travel and rollback, but they accumulate quickly. Expiring old snapshots reduces metadata bloat and frees storage tied to unneeded versions.

Orphan file cleanup

If a write fails or a process terminates unexpectedly, Iceberg may leave behind unreferenced files. These orphan files do not appear in any manifest but still occupy object storage. Cleaning them up keeps storage predictable and prevents long-term drift between table state and underlying files.

Profiling and statistics refresh

Query planners rely on table statistics to make decisions about pruning, partition selection, and join order. As data changes, these metrics become stale. Refreshing them improves planning accuracy and reduces unnecessary reads.

How Starburst Galaxy operationalizes Iceberg maintenance

The Starburst Managed Ingest service goes beyond simple data loading by including a fully automated, live table maintenance service for your Iceberg tables.

Once data is ingested, Galaxy continuously and autonomously performs all critical maintenance tasks, including intelligent compaction, orphan file removal, and snapshot expiration. This seamless, built-in automation eliminates the need to script, schedule, or manually track complex maintenance jobs, allowing your data engineering team to focus entirely on delivering analytic value. All of this ensures that your Iceberg tables remain performant, governed, and compliant.

User-defined partitioning

Partitioning is one of the most important levers for Iceberg performance, and Starburst Galaxy now gives data engineers direct control through user-defined partitioning.

When configuring a Managed Iceberg Pipeline, you can explicitly define the partitioning scheme for your resulting Iceberg tables. This involves specifying the transformation functions (like day(), hour(), or bucket()) applied to your source columns. The goal is to select partitioning keys that precisely align with your most common query patterns, including date transformations, region identifiers, or hash-distributed IDs. This ensures that the Iceberg table layout is optimized for high-performance query filtering and data skipping.

By letting engineers choose the right partition strategy up front, Galaxy removes guesswork and keeps pipelines aligned with real workload behavior.

Operational observability and control enhancements

Starburst Galaxy doesn’t just ingest your data. It also makes that process visible to you by providing operational observability features. In this release, these features have been further augmented and extended to include the following enhancements.

Metrics dashboards improvements

Starburst Galaxy now includes enhanced Metrics Dashboards that provide real-time and historical visibility into ingestion performance. Engineers can monitor throughput, latency, and error rates across Kafka and Amazon S3 pipelines directly within the product.

These upgraded dashboards give teams clearer insight into data freshness and pipeline health, making it easier than ever to troubleshoot issues and ensure ingestion workloads are running reliably at scale.

Ingest Observability via OpenTelemetry

Starburst Galaxy now provides enhanced ingest observability through expanded OpenTelemetry support, giving teams unified metrics, traces, and event data across all ingestion services. This release makes it easier to plug Galaxy telemetry directly into existing monitoring platforms, including Amazon CloudWatch, for end-to-end visibility and alerting. Engineers can now diagnose ingestion issues faster, validate data freshness, and maintain reliable pipelines without relying on separate or custom instrumentation.

Reset Replay

This release provided enhancements to our Reset Replay feature. Reset Replay provides data engineers with a practical way to resolve issues in streaming or incremental pipelines without resorting to manual backfills or ad hoc repair scripts.

What does Reset Replay do?

The feature combines Iceberg’s snapshot capabilities with Galaxy’s managed ingestion engine. If data is parsed incorrectly, if upstream schemas change unexpectedly, or if transformation logic needs to be updated, engineers can reset a table to a previous snapshot and replay all upstream data through the updated logic. The system reprocesses the data exactly once, preserving correctness while avoiding duplication or data loss.

How would you use Reset Replay?

This workflow is especially useful for JSON parsing, schema drift, and other situations where upstream changes cause breakage. Instead of maintaining side tables or writing complex merge processes, teams can apply the fix and let Galaxy reprocess historical data at high throughput. Because Iceberg tracks exactly what data contributed to each snapshot, replaying the table is deterministic and consistent. The result is a simpler, more reliable recovery path that removes a significant amount of operational overhead from data engineering teams.

Providing value for our Customers

Using Managed Iceberg Pipelines, Starburst Galaxy isn’t just providing hypothetical value. It’s driving immediate results for our customers by simplifying their data ingestion pipeline.

For example, Prodege, a leading consumer insights and rewards platform, is using Managed Iceberg Pipelines live tables to enhance data management for one key pipeline. They operate a complex analytics environment that relies on fresh, reliable data from many operational systems.

How Prodege used Starburst Galaxy to achieve 800% performance improvements

During a recent POV, they experienced 800% performance improvements when replacing specific Snowflake workloads with Starburst.

To achieve these results, Prodege’s engineering team used Starburst Galaxy to build a unified, governed layer for streaming ingestion and Iceberg table management. Using Managed Iceberg Pipelines for one key pipeline, they reduced the effort required to maintain data quality, particularly using the live tables feature to automate certain data maintenance tasks.

What do they say?

As Prodege says, “Starburst Galaxy live tables are our number one used and also favorite feature in Starburst Galaxy. It just makes it so much easier for us, so I’m sure we’re gonna be using those features as soon as they roll out.”

Why this matters

With Galaxy’s live tables, schema flexibility, and upcoming delete support, Prodege can manage high-volume Kafka workloads more efficiently while accelerating downstream analytics and modeling.

Coming soon to Starburst Galaxy

Starburst Galaxy is constantly adapting and growing. In the coming weeks, we will be happy to support the following additional features.

CSV Support for file ingest

Starburst Galaxy will soon expand its file ingest capabilities with native support for CSV files stored in Amazon S3. Engineers will be able to ingest CSV datasets directly into Iceberg tables without writing custom parsing logic or standing up separate ETL jobs.

After the release, Galaxy will automatically detect column types, validate the structure, and hydrate the data into governed, query-ready tables. This makes it significantly easier to onboard legacy datasets, vendor exports, and spreadsheet-driven partner feeds, all while preserving a consistent ingestion workflow across formats.

Avro Support for streaming ingest

Starburst Galaxy will soon add native support for Avro-encoded Kafka streams, enabling engineering teams to ingest production event data with full schema evolution, compression, and schema-registry validation.

After the release, this enhancement lets organizations connect enterprise-grade Kafka pipelines directly to Iceberg tables without custom serializers or downstream cleanup. As Avro messages evolve, Galaxy validates compatibility at ingest, ensuring that pipelines remain stable, governed, and continuously hydrated with analytics-ready data.

Schema registry for safer streaming pipelines

Starburst Galaxy is making streaming pipelines safer for users. Streaming pipelines often break when upstream producers evolve their schemas.

Starburst Galaxy now prevents that failure mode through built-in schema registry integration. Galaxy validates Avro messages against Confluent-compatible schema registries during ingest and enforces compatibility rules before data ever lands in your Iceberg tables.

This creates self-healing pipelines that continue to operate even as schemas change over time, eliminating silent ingestion errors and reducing the need for manual fixes. By shifting schema enforcement into the ingestion layer, Galaxy ensures that downstream tables remain consistent, queryable, and compliant as event streams evolve.

Starburst Galaxy data ingestion makes Trino available to everyone

Collectively, these updates continue to improve Starburst Galaxy, making it easier than ever to leverage Trino by providing a data ingestion framework that supports users.

If you are interested in learning more about Starburst Galaxy, Managed iceberg Pipelines, and our data ingestion capabilities, please sign up for our upcoming webinar, Starburst Galaxy Managed Ingest: Next-Gen Data Ingestion for Iceberg.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Starburst’s mission is to free our customers to see the invisible and achieve the impossible

How Starburst Galaxy Uses Data Ingestion to Complement Trino

More deployment options

Start for Free with Starburst Galaxy

Introducing the Great Lakes Connector for Starburst Enterprise

Managed Iceberg Pipelines continues to progress

Streaming ingest

What kind of streaming can be ingested?

What are the benefits to users?

File ingest

How does Starburst Galaxy file ingest work?

What are the benefits to users?

Data maintenance

Compaction

Snapshot expiration

Orphan file cleanup

Profiling and statistics refresh

How Starburst Galaxy operationalizes Iceberg maintenance

User-defined partitioning

Operational observability and control enhancements

Metrics dashboards improvements

Ingest Observability via OpenTelemetry

Reset Replay

What does Reset Replay do?

How would you use Reset Replay?

Providing value for our Customers

How Prodege used Starburst Galaxy to achieve 800% performance improvements

What do they say?

Why this matters

Coming soon to Starburst Galaxy

CSV Support for file ingest

Avro Support for streaming ingest

Schema registry for safer streaming pipelines

Starburst Galaxy data ingestion makes Trino available to everyone

Start for Free with Starburst Galaxy