← Deep Cuts
Data Governance & Quality·6 min read·2 December 2024

Data Contracts: The Interface Between Teams That Nobody Wrote Down

The pipeline breaks on a Tuesday morning. The investigation takes a few hours. The root cause is a column rename — user_id to account_id — that an upstream application team made as part of a routine refactor three weeks ago. They had no idea anyone was consuming that column from the database. Nobody told them. Nobody had to.

This is one of the most common and most avoidable data incidents in organisations that have more than one team producing data. The application team did nothing wrong by their own standards. They refactored their schema, updated their application code, tested their service, and shipped. The downstream data pipeline wasn't their problem to think about. And in most organisations, that's exactly the issue — it genuinely isn't.

Data contracts are how you change that. Not with blame, but with a formal structure that makes the dependency visible and gives both sides an agreed basis for coordinating changes.

What a Data Contract Actually Is

A data contract is a formal agreement between a data producer and a data consumer that specifies what data will be provided, in what format, with what guarantees. It covers schema — field names, types, nullability — but also semantic expectations: what value ranges are valid, what business logic the fields encode, what "null" means in context versus what an empty string means.

A complete contract also includes SLA commitments: how fresh the data will be, what completeness guarantees exist, and what the producer is committing to in terms of availability. And it includes a breaking change policy — how much notice consumers will receive before a schema change that breaks existing consumers, and what the process is for requesting a change.

That last part is where most contracts fail when they're designed only as technical documents. The schema definition is the easy part. Getting a team to commit to a two-week notice period before renaming a column requires them to accept an obligation they didn't have before.

Why Data Producers Don't Think in These Terms

Application engineers are not, by training or incentive, thinking about downstream data consumers. Their primary obligation is to the product — the service they own, the users it serves, the sprint they're in the middle of. The database schema is an implementation detail. Changing it is a routine engineering decision. The idea that this decision has external stakeholders is genuinely foreign to many engineering teams.

This isn't a competence problem. It's a systems problem. Nobody designed a process that made data consumers visible to producers. Nobody put the analytics pipeline on the application team's list of things to check before shipping. In the absence of that structure, producers change things, consumers break, and the data team spends its week doing archaeology.

Establishing a data contract changes the system. It creates a named interface between the two teams. It makes the dependency explicit. And it gives the producer something concrete to check — not "is anyone using this?" (unanswerable) but "does this change violate the contract?" (answerable).

What Good Contracts Contain

The minimum viable data contract has four components:

  • Schema definition with versioning. Field names, data types, nullability constraints, and a version number. When the schema changes, the version increments. Consumers know what version they're built against.
  • SLA commitments. How fresh will the data be? What completeness guarantee exists — is a partially delivered dataset acceptable, and under what conditions? What happens when the SLA is missed?
  • Breaking change policy. How much notice will consumers receive before a breaking schema change? What constitutes a breaking change versus a backwards-compatible addition? Is there a deprecation path for old fields?
  • Ownership contacts. Who owns the contract on the producer side? Who are the registered consumers? When something breaks, who calls whom?

The Tooling Landscape

There are several tools that support contract enforcement. dbt has a native contract feature that lets you assert the schema of a model and fail the build if it drifts. Great Expectations and Soda provide data quality assertions that can be run against incoming data, effectively enforcing contract terms at the ingestion point. Schema registries — Confluent's registry for Kafka-based systems is the most common — enforce schema compatibility at the message level, blocking producers from publishing schemas that would break existing consumers.

The honest observation here is that the tool matters less than the process. You can enforce a data contract with nothing more sophisticated than a shared document and a code review checklist. What you can't substitute is someone owning the contract — a named person or team who is responsible for maintaining it, reviewing it when either side wants to make a change, and escalating when it's violated.

The Organisational Problem You Can't Tech Your Way Out Of

Data contracts only work if there are real consequences for breaking them. That sounds obvious, but it means engineering leadership — not just the data team — has to treat downstream data consumers as legitimate stakeholders in schema decisions. In most organisations, they don't. Analytics and data pipelines are viewed as a separate system, someone else's concern, insulated from the application engineering process.

Getting traction on contracts requires making the cost of the current situation visible. An incident log of every pipeline break caused by an undisclosed schema change, with the hours lost to investigation and remediation, is more persuasive than any architectural argument. Once engineering leadership understands that these incidents are recurring and preventable, the conversation about contracts becomes much easier.

It also helps to frame contracts not as a restriction on application teams but as a service to them. Without a contract, any schema change carries unknown risk — is someone depending on this field? A contract answers that question definitively. Teams that have adopted contracts often report that they're more willing to clean up their schemas precisely because the contract makes the impact transparent.

What Contracts Actually Deliver

Data contracts don't prevent all schema changes. Producers will still need to evolve their schemas, and consumers will still need to adapt. That's not a failure of the contract model — it's how systems evolve.

What contracts deliver is visibility, planning, and communication. The upstream team can still rename a column. But they do it with notice, with a migration path, and with the downstream team's acknowledgement. The pipeline still changes. But it changes in a scheduled maintenance window, not in a 2am incident.

That shift — from uncoordinated change that breaks things to coordinated change that's absorbed — is what mature data organisations look like from the outside. The infrastructure is the same. The process around it is not.

Written by ATHING

We design and build data infrastructure, automation pipelines, and AI systems for organisations that need them to work.

Talk to Us