Data Governance & Quality·5 min read·27 February 2024

Data Quality Checks Belong in the Pipeline, Not the Dashboard

Most data quality issues are discovered by a business user who notices something wrong in a report. By the time that happens, the bad data has been in production for hours or days. Downstream systems have consumed it. Decisions may have been made on it. Going back and fixing the source data doesn't undo any of that.

This is the timing problem at the centre of most data quality conversations, and it's almost never framed correctly. Organisations invest in data quality tooling — they add monitoring dashboards, they assign data stewards, they build review processes — and then wonder why data quality incidents keep happening. The answer is usually that they're catching problems at the end of the chain, where the cost of remediation is highest, rather than at the beginning, where bad data can be stopped before it enters the system.

Where Checks Can Live, and Why the Order Matters

There are three places in a typical data stack where quality checks can be placed. They are not equivalent. Each catches problems at a different stage, with a different cost of remediation.

The Dashboard Layer

Checks at the dashboard or report layer catch problems after the data is visible to stakeholders. The bad data is already in production, already in the warehouse, already available to every system that reads from it. Fixing the root cause doesn't fix what's already been served. The remediation cost is at its highest here — not just in engineering time, but in the organisational cost of correcting decisions that were made on bad numbers and rebuilding trust in data that was publicly wrong.

This is where most organisations discover their quality issues because it's where the data is most visible. That visibility is useful for detection. It's a terrible property in a quality gate.

The Transformation Layer

Checks in the transformation layer — dbt tests, SQL assertions, model-level validation — catch problems after data has landed in the warehouse but before it's surfaced in reports. This is meaningfully better. Bad data doesn't reach stakeholders. But it's already in storage, it's already been ingested, and it's already available to any system reading directly from the raw tables. Remediating here still requires backfilling, usually with a corrected dataset, and managing the window during which the raw layer was wrong.

The Ingestion Layer

Checks at the ingestion layer — at the point of entry, before data is written to the warehouse — catch problems where the cost of remediation is lowest. Bad data never enters the system. The pipeline halts, an alert fires, the issue is investigated and resolved at the source. Nothing downstream is affected because nothing downstream has seen the bad data yet.

This is where quality checks should be. Not exclusively — transformation-layer checks have real value as a second line of defence — but the ingestion layer is where you prevent damage, not just detect it.

What Pipeline-Layer Checks Look Like in Practice

Ingestion-layer quality checks are not exotic. They are assertions about the data that must be true for the pipeline to continue. Schema validation: does the incoming data match the expected schema, with the expected types and nullability constraints? Null checks: are the fields that must be populated actually populated? Referential integrity: do the foreign keys in the incoming data resolve against the reference tables they're supposed to? Value range assertions: is this field within the expected range, or has something produced an order-of-magnitude error that will corrupt every aggregate it touches? Row count comparisons: does the volume of incoming data fall within the expected range for this time window, or is the source delivering an anomalous payload?

Any of these failing should halt the pipeline and raise an alert. Not log a warning. Not write a flag to a monitoring table. Stop, and alert. The pipeline does not continue until the quality issue is resolved.

The Fail-Fast Principle

A pipeline that ingests bad data and continues running does more damage than a pipeline that stops. This seems obvious, but there is real resistance to it in practice. Stopping the pipeline is visible. It creates an alert, it triggers an on-call response, it interrupts the schedule. Continuing with bad data is invisible — until a stakeholder notices something wrong in a dashboard and the investigation begins.

Fail fast is not a concession to perfectionism. It is the recognition that invisible bad data in production is worse than a visible pipeline failure. A stopped pipeline is a solvable, bounded problem. Bad data in production is an unbounded problem — you don't know how far it's propagated or what decisions it has influenced until you trace it.

The Objection About Alert Volume

The consistent pushback against pipeline-layer quality checks is that they create noise — more failures, more alerts, more on-call burden. This is true. It is also the correct outcome. Adding quality gates at ingestion will surface data quality problems that were previously passing silently into the warehouse. Those problems were always there. The checks make them visible at a point where they're fixable rather than at a point where they're already in production.

The engineering response to alert noise is to tune thresholds, not to remove the checks. If a row count assertion is triggering on normal variation because the threshold was set too tightly, the fix is to calibrate the threshold against historical patterns. If a null check is firing on a field that is genuinely nullable in some upstream scenarios, the fix is to update the schema contract. The checks stay. They get better.

The alternative — removing checks to reduce noise — produces a system that is quiet and wrong. The CFO dashboard shows the wrong number. The board report was built on a bad dataset. The Monday morning meeting is derailed by an hour-long argument about which figure is correct. That is a much more expensive form of noise than a pipeline alert at 3am.

Quality Is an Engineering Problem

Data quality is consistently treated as a governance problem. Assign ownership. Build a data catalogue. Define stewardship processes. Review reports for anomalies. These activities have value, but they address quality after the fact. They do not prevent bad data from entering the system — they create structures for detecting and responding to it once it has.

The engineering framing is different: build systems that cannot accept bad data. Define what "good" means at the point of entry, and reject anything that doesn't meet the definition. When the definition is wrong or incomplete, fix it — but don't remove the gate while the conversation is happening.

Organisations that treat data quality as an engineering constraint — part of the pipeline design, not an afterthought — have materially fewer quality incidents than those that treat it as a monitoring and governance problem. The data doesn't improve because people are watching it more carefully. It improves because the system stops accepting bad data.

Written by ATHING

We design and build data infrastructure, automation pipelines, and AI systems for organisations that need them to work.

Talk to Us