← Deep Cuts
AI & Machine Learning·7 min read·18 February 2025

Why Most ML Models Never Reach Production

The industry statistic that keeps circulating — somewhere between 80 and 90 percent of ML models never make it to production — is usually treated as a model quality problem. It isn't. The models are often fine. The infrastructure around them is what fails.

Most organisations that have invested in data science have a graveyard of notebooks. Models that performed well on the test set, got presented in a review meeting, and then quietly stopped being anyone's responsibility. The cause is almost never the algorithm. It's the gap between the environment where the model was built and the environment where it needs to operate.

Training-Serving Skew

This is the single most common cause of production ML failures, and it's almost always invisible until it isn't.

When a data scientist builds a model, they compute features from historical data — aggregations, transformations, derived signals. The model learns on those features and performs well. Then someone tries to deploy it. At inference time, those same features have to be computed in real time, from a different system, often by a different team, using different code. The logic is supposed to be equivalent. It rarely is, exactly.

A feature that was computed as a trailing 30-day average in a batch job gets reimplemented as a query against a production database with slightly different timestamp handling. A categorical encoding that was fitted on training data doesn't account for new categories that appeared in production. A null-handling assumption differs between the Python data science environment and the Java service that ended up owning inference.

The model degrades. Slowly at first, then noticeably. Nobody immediately suspects the features because the model was validated against the test set and it looked fine there. Tracking down training-serving skew when it's already in production is one of the more time-consuming debugging exercises in applied ML.

No Deployment Infrastructure

Data science teams are not typically staffed or incentivised to build production services. A model that exists as a serialised file in an S3 bucket is not deployed. It is stored. Those are different things.

Actual deployment requires decisions that the model training process doesn't force you to make: What are the latency requirements? Synchronous or asynchronous inference? How does the calling system authenticate? What happens when the model endpoint is unavailable — does the calling service fail, fall back to a rule, or queue the request? What does a rollback look like if the new model version behaves badly?

Most ML projects don't answer these questions until someone tries to actually put the model in front of production traffic, at which point the answers become urgent and the scope of work expands significantly. Projects stall here. The model is "done" from the data science perspective, but nothing is in production, and the path to get there requires engineering work that nobody scoped.

No Ownership

Models degrade. The data they were trained on reflects a world that changes — customer behaviour shifts, product catalogues are updated, market conditions evolve. A model that was accurate at launch drifts over time unless it is actively maintained.

In most organisations, there is a clear ownership gap. Data scientists build models. Engineers own production services. The model lives in the space between those two responsibilities. When the model starts underperforming six months after launch, the data scientist who built it has moved on to the next project. The engineer who maintains the service doesn't have the context or the tooling to retrain it. Nobody is monitoring prediction distributions. Nobody has a process for triggering a retrain when performance degrades beyond a threshold.

This isn't a people problem — it's a structural one. It reflects the absence of a defined operational model for ML, rather than any individual failing to do their job.

Evaluation That Doesn't Match Production

A model evaluated on a held-out test set has been evaluated on a snapshot of historical data. That snapshot shared the same distribution as the training data, was cleaned with the same pipeline, and contained none of the edge cases that production traffic generates every day.

Distribution shift is real and common. The kinds of inputs a model encounters in production — user queries with typos, transactions from new geographies, sensor readings from hardware variants that weren't in the training corpus — are not well represented in a held-out set constructed from the same data the model was trained on. Offline metrics look strong. Online metrics, when someone finally measures them, tell a different story.

The evaluation gap also hides latency problems. A model that takes 400 milliseconds to produce a prediction looks fine in a notebook. In a user-facing application with a 200-millisecond budget, it is not deployable without significant optimisation. These constraints are not discovered until someone tries to integrate the model into a real system.

What Actually Works

MLOps has become a buzzword, but the underlying engineering discipline is real and matters. The organisations that consistently ship ML models to production treat model deployment as a software engineering problem, not a data science deliverable.

Feature stores are the most impactful structural change most ML teams can make. A feature store maintains a single definition of each feature that serves both the offline training context and the online serving context. Training-serving skew becomes structurally impossible when the same computation path serves both purposes. The offline store provides point-in-time correct feature snapshots for training; the online store serves the same features at low latency for inference. Both are derived from the same definitions, maintained in one place.

CI/CD for model deployment — treating model versions like software releases, with automated validation gates, staged rollouts, and rollback capability — eliminates the manual fragility that causes deployments to stall. A model that passes automated evaluation criteria gets promoted through environments without requiring someone to manually shepherd it through the process.

Monitoring that tracks prediction distributions, not just infrastructure health, is what catches degradation before it becomes a business problem. Knowing that a server is healthy tells you nothing about whether the predictions it is serving are still accurate. You need to track the distribution of model outputs over time and compare against a baseline, then alert when the distributions diverge beyond a threshold.

The Honest Assessment

Most organisations do not need more models. They need their existing model investments to actually function in production — reliably, consistently, with clear ownership and defined operational procedures for what happens when something goes wrong or the world changes.

That is an engineering problem. Solving it requires the same disciplines that make any production software reliable: versioning, testing, deployment automation, monitoring, and clear ownership. Data science expertise is necessary but not sufficient. Until organisations treat ML deployment as an engineering function with engineering standards, the graveyard of shipped notebooks keeps growing.

Written by ATHING

We design and build data infrastructure, automation pipelines, and AI systems for organisations that need them to work.

Talk to Us