← Deep Cuts
Data Architecture·6 min read·10 March 2026

Idempotency in Data Pipelines: The Property That Separates Reliable Systems from Fragile Ones

Most pipeline failures aren't caused by bad code. They're caused by code that was written to run once and then asked to run again.

A scheduler retries a failed task. An engineer manually reruns a broken DAG after fixing a bug. A cloud provider interrupts a job mid-execution and restarts it. These aren't edge cases — they're Tuesday. And if your pipeline wasn't designed to handle them, your data warehouse is quietly filling up with duplicates, your aggregations are wrong, and nobody knows it yet.

Idempotency is the property that makes reruns safe. A pipeline is idempotent if running it multiple times with the same input produces the same output as running it once. It sounds simple. It isn't.

Why Most Pipelines Aren't Idempotent by Default

The default behaviour of most data movement operations is append. You run an insert, rows go in. You run it again, those rows go in again. This is fine when you control exactly when and how often a pipeline runs. In practice, you never fully control that.

The problem compounds at scale. A pipeline that appends tens of thousands of rows on every run, triggered by an unreliable upstream signal, can silently double or triple a dataset before anyone notices. By the time someone does notice — usually because a dashboard shows twice last month's revenue — the damage is already in production, downstream systems have consumed the bad data, and untangling it is a forensic exercise.

The root cause is almost always the same: the pipeline was designed around the happy path. It works correctly when everything goes right. It breaks in unpredictable ways when anything goes wrong.

The Four Patterns That Actually Work

1. Upsert Instead of Insert

The most direct fix is to replace blind inserts with upserts. Instead of appending every incoming row, you define a unique key and let the database decide: update the row if it exists, insert it if it doesn't. Most modern warehouses — Snowflake, BigQuery, Redshift, Databricks — support this natively via a MERGE statement.

The outcome is defined by the data, not by how many times the operation runs. Run it once or ten times — the result is the same. The catch is that you need a reliable unique key. If your source data doesn't have one, or if it's composite and inconsistently populated, that's a data quality problem that needs solving before idempotency is even possible.

2. Partitioned Overwrites

For large-scale batch pipelines where row-level merges are too expensive, partitioned overwrites are the right tool. Rather than appending to a table, you overwrite a specific partition — typically a date partition — on every run. If the job fails halfway through and is retried, the retry simply overwrites that partition from scratch. No duplicates. No partial writes.

The idempotency comes from the overwrite semantics, not from deduplication logic. You also want a small lookahead window — reprocessing the last few days rather than just today — to handle late-arriving data, which is a real problem in event-driven architectures where events can arrive hours or days after they occurred.

3. Watermarks and Sequence Numbers

Not all pipelines operate on time-partitioned data. When you're consuming from a changelog, an event stream, or a transactional source, watermarks give you a reliable mechanism for picking up exactly where you left off without reprocessing what you've already handled.

The critical detail: the watermark update must be atomic with the data write — either both succeed or neither does. If you update the watermark after the data write and the process dies between the two operations, you've committed data without advancing the watermark, and your next run will reprocess it. Wrap them in a transaction, or explicitly design the data write to be safe if it runs twice.

4. Idempotency Keys for External APIs

Pipelines that call external APIs — pushing data to a CRM, triggering webhooks, initiating financial transactions — are the hardest to make idempotent because you don't control the other side. The standard pattern is to generate a stable idempotency key derived from the content of the record itself, rather than from something transient like a timestamp or a UUID generated at runtime.

Many APIs support idempotency keys natively — Stripe is the canonical example, but many modern SaaS platforms have followed. For those that don't, you need to maintain your own log of sent records and check it before each call. It's more work, but the alternative is silent duplicates in systems you don't own.

The Testing Gap Nobody Talks About

Most pipeline test suites validate the happy path: given input A, does the pipeline produce output B? Almost nobody tests what happens when you run the pipeline twice with the same input A.

A second-run test is one of the most valuable tests you can add to a data pipeline. Run the pipeline, record the row count and a checksum of the output, run it again with identical input, and assert that nothing changed. It takes minutes to write and will catch idempotency regressions before they reach production — including the ones introduced by well-meaning engineers who didn't realise their change broke the guarantee.

Where This Goes Wrong in Practice

The patterns above are well-understood. The implementation failures we see most often aren't about not knowing them — they're about edge cases that weren't thought through.

  • Composite keys with nulls. A unique key across three fields breaks down the moment any of them is nullable. Most databases do not consider two null values equal, so you get duplicate rows even with a properly structured upsert. Nulls in key fields need to be treated as a data quality error, not a default.
  • Late-arriving deletes. Partitioned overwrites handle late-arriving inserts and updates cleanly. They handle deletes poorly. If a record was deleted at source after you already wrote it to the warehouse, your overwrite won't remove it. You need a soft-delete pattern or a CDC-aware strategy.
  • Watermarks without transactions. Advancing the watermark in a separate step from the data write, without wrapping both in a transaction, creates a window where a process failure leaves you in an inconsistent state. This is one of the most common sources of subtle data duplication we see in production systems.
  • Non-deterministic transformations. If your pipeline uses the current timestamp or a random function in the actual output data — not just for logging — two runs of the same input will produce different outputs. This isn't just an idempotency problem. It makes the pipeline fundamentally untestable.

The Operational Payoff

Idempotent pipelines change what on-call looks like. When a job fails in a non-idempotent system, the on-call engineer has to understand the failure, assess the data state, potentially clean up partial writes, and then decide if it's safe to rerun. That's thirty minutes of careful work at 2am, with real risk of making things worse.

In an idempotent system, the answer to almost every pipeline failure is: rerun it. The pipeline itself handles the recovery. The on-call engineer acknowledges the alert and goes back to sleep.

Over the lifetime of a data platform, the operational cost of non-idempotent pipelines — in engineer time, in data quality incidents, in the slow erosion of trust in the data — is substantial. Building idempotency in from the start is almost always cheaper than retrofitting it into a system that's already in production and already wrong.

If you're looking at a data estate where reruns require a runbook, that's the problem to fix first.

Written by ATHING

We design and build data infrastructure, automation pipelines, and AI systems for organisations that need them to work.

Talk to Us