Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Data engineers spend a lot of time fixing issues that arise in ingest pipelines. Whether it’s schema drift, upstream teams shipping new fields, or difficulties parsing logic, the results are the same. 

Suddenly, you’re staring at a table that is broken. 

Unfortunately, the usual solutions to this problem are all painful. 

  • Accept partial data loss.
  • Run one-off backfills.
  • Stitch together fragile workflows around your streaming platform.

How Starburst fixes the data ingestion problem

At Starburst, we’ve been exploring a different approach, one built on top of Iceberg that uses the incremental materialized views model. It’s called rewind and backfill, and its goal is simple–give data engineers a safe, predictable tool to allow do-overs for incremental pipelines.

This post explains what Rewind and backfill is, the kinds of problems it solves, and the architectural ingredients that make it possible.

What Is Rewind and Backfill?

At its core, Rewind and Backfill combines two capabilities. Let’s look at both of them individually

1) Rewind

The first is the Rewind function. Time-travel the underlying Iceberg table back to a previous snapshot—a point in time before the bad data or incorrect logic was applied.

2) Backfill

The second is the Backfill function. It recomputes the incremental materialized view that defines the table, from that point forward, using updated computation or parsing logic.

How Rewind and Backfill works

How do these two halves work together without causing a version clash? 

Crucially, the backfill process generates a separate timeline while preserving the original data. This architectural choice significantly simplifies iterative error correction. 

The process is straightforward: 

  1. First, you update the underlying data parsing and column logic (for instance, by adding the new field definition or correcting a data type). 
  2. Next, you choose a point in time before the misaligned or new data began arriving. This doesn’t need to be exact, so you can simply choose a point that you’re sure is before the change occurred. 
  3. Finally, you save the changes, which triggers a backfill for the data pipeline. 

During the backfill, all existing columns are recomputed exactly as they were before, while the new field is now correctly parsed and populated from the source data. This unified, guided operation replaces a complex, error-prone workflow, ensuring data integrity and consistency.

The benefits of Rewind and Backfill

The result is that if the initial Rewind and Backfill operation doesn’t yield the desired results, you can easily modify the schema and repeat the process. Each iteration of the rewind and backfill creates a distinct, new timeline.

This means that if an error occurs or a change is needed, you avoid complex, manual data backfills or custom scripting. Instead, you can return the pipeline to a state just before the problem, re-execute the processing with the corrected logic, and resume seamlessly. This ensures data consistency without creating duplicate records or losing any data.

Example: Parsing JSON data from Kafka into Iceberg using Parquet

In these scenarios, it’s best to start with a concrete example. Consider a common pattern–   parsing JSON data from Kafka (or files) into Iceberg tables using Parquet.

Let’s explore an example of how Starburst Rewind and Backfill will help. 

Breaking down the problem

Consider the challenge of ingesting JSON data from a Kafka stream or a set of files into an Iceberg table within a data lake environment. This process requires precise mapping of incoming data fields to the target Iceberg table columns. 

Several questions inevitably arise. 

  1. Is a source field a simple string or a more complex structure? 
  2. Should a numerical field, like temperature, be stored with decimal precision as a double or as a whole number long
  3. Furthermore, what is the protocol when an upstream team introduces a completely new field to the data source without prior notice?

Despite best practices and robust processes, the mappings between source and target data are prone to drifting and breaking over time. 

Traditionally, organizations faced a difficult choice when a new field or a type of misalignment occurred. The first option was to accept the loss of early data for the new field, sacrificing completeness. The second, more complex, path was to construct a complicated and often brittle backfill process, which carried the significant risk of duplicate ingestion or data misalignment with the existing records. Both traditional methods were inefficient and prone to error.

What to consider first

You have to decide how each JSON field maps into Iceberg columns:

  • Is “origin” a simple string or a timestamp?
  • Is “temperature” a double or a long?
  • Which fields are required vs. nullable?
  • How do you handle nested structures and arrays?

Even on a good day, these decisions are easy to get slightly wrong. And they don’t stay correct forever. Upstream producers evolve their payloads, add new fields, or change semantics. A product team might start publishing a new field; downstream analysts want to query it immediately.

Why this scenario often goes wrong

In an ideal world, there is a formal change process and plenty of coordination. But in reality, schemas drift without warning, causing new fields to appear before downstream parsing logic is updated. Data engineers are pulled into ad hoc backfills and repair jobs.

How does this causes downstream problems

When you finally fix the parser or add the new column, you often face an unpleasant choice: Either accept that existing data is incomplete or mis-typed, or engineer a complex backfill process that reprocesses historical data carefully enough not to double-count or corrupt the table.

This is exactly the gap that Rewind and backfill aims to fill.

Under the Hood: Why Starburst Can Promise a Do‑Over

This section will delve into the core components of the Starburst ingest architecture that enable Rewind and Backfill. We will demonstrate that these features are not mere additions but inherent functionalities stemming directly from the fundamental design principles of our architecture.

Exactly-once processing

Our ingest system is designed to track metadata related to the upstream data sources that contribute to each state of the destination table. For example, for Kafka ingest, we track the partition and offset information corresponding to each destination table state. 

During regular operation, this metadata is committed atomically, ensuring it is securely linked with the corresponding Iceberg table snapshot, which backs our exactly-once delivery guarantee. A significant benefit of this process is the ability to use this information during backfilling operations, thereby preventing the redundant or double processing of data.

Fully managed control plane

The ingest system ensures data consistency and reproducibility by maintaining end-to-end control over the table’s state. Since the system solely manages the table, it reliably holds the authoritative snapshot, providing a single source of truth for the data. 

This closed management system is essential because it prevents external systems from making unauthorized mutations, thereby ensuring the table’s state is always predictable and reproducible.

Incremental materialized views

Tables are defined as incremental material views, so the system knows the mapping between the checkpoints and states of the upstream table and the downstream tables. Due to this mapping, backfill is a natural, deterministic recomputation. You can update the parsing logic and re-run exactly over the right data slices.

Efficient multitenant architecture

A key benefit of this system is its ability to automatically adjust compute resources to perfectly match the size and shape of your current workload. Starburst has been tested to handle speeds up to 100 GB/s for streaming ingest from Kafka to Iceberg. 

This ensures that even if you rewind and need to reprocess many days of data, the system can catch up quickly. Once the reprocessing is complete, the compute resources automatically scale back down to the appropriate size, optimizing efficiency and cost.

Starburst leverages the power of Iceberg

Rewind and Backfill delivers a rare capability in streaming data systems–the ability to fix the past confidently and quickly. It’s enabled by Icehouse architecture’s exactly-once foundation, tight metadata coupling with Iceberg snapshots, incremental materialized views, and elastic multitenant compute. 

Together, these make the platform both powerful and forgiving, so you spend less time wrangling backfills and more time delivering value. Take Starburst for a spin with a free trial.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free