← Deep Cuts
Data Architecture·7 min read·14 January 2025

Lakehouse vs Data Warehouse: When the Classic Architecture Isn't Enough

The data warehouse isn't dead. But for a growing number of organisations, it was never the right answer in the first place — and the cost of that mismatch only becomes visible once you're already committed.

The warehouse earned its position. Structured data, fast SQL, mature tooling, decades of investment in query optimisation — for teams running financial reporting, sales analytics, or anything that lives neatly in rows and columns, the warehouse is a solved problem. Snowflake, BigQuery, and Redshift are genuinely excellent at what they do. The issue is the assumptions baked into what they do.

A warehouse wants structured data. It wants schema defined up front. It wants transformation to happen before storage — the ETL model. That was a reasonable constraint when the dominant analytical workload was a SQL query from a BI tool. It stops being reasonable when you have machine learning engineers consuming raw event logs, data scientists running feature extraction on unstructured text, or operational systems generating JSON payloads with a schema that evolves every sprint.

Where the Warehouse Breaks Down

The friction shows up in specific places. First, semi-structured data. Log files, API responses, sensor streams, clickstream events — almost none of this arrives as clean tabular data. Getting it into a warehouse requires a transformation step that flattens it, normalises it, and discards the things you didn't know you'd need. That last part is where things go wrong. You transform the data to fit the schema you need today. Six months later, someone needs a field you didn't keep.

Second, ML workloads. A data scientist training a model doesn't want to query a warehouse. They want raw data — often the pre-transformation version — accessible in bulk, in formats that Python can read directly. Exporting data from a warehouse to feed an ML pipeline is an operational overhead that compounds at scale. It also means you're maintaining two copies of the same data in different shapes.

Third, cost at volume. Warehouses price on compute and storage together, or on storage at a premium to object storage. When your data volumes get large, the economics start to look uncomfortable. Object storage on S3 or GCS is cheap. Warehouse storage is not.

What a Lakehouse Actually Is

The lakehouse pattern starts with the observation that object storage is cheap and infinitely scalable, but raw object storage has no structure, no transactions, and no schema enforcement. You can dump anything into S3, but you can't run a reliable MERGE statement against it.

Table formats — Delta Lake, Apache Iceberg, Apache Hudi — bridge that gap. They sit on top of object storage and add the things you need for serious data work: ACID transactions, schema enforcement and evolution, time-travel queries, partition management, and efficient metadata handling. The combination gives you warehouse-quality semantics on data-lake economics.

In practice, a lakehouse architecture typically looks like raw data landing in object storage, processed data stored as Delta or Iceberg tables in the same storage tier, and a compute layer — Spark, Trino, Databricks, or a warehouse connector — sitting on top. SQL analysts query the processed tables. ML engineers read raw or lightly processed data directly. Both work against the same underlying storage.

The Trade-Offs You Don't Hear About

The lakehouse has more moving parts. A managed warehouse abstracts away storage, compute, metadata management, and query planning behind a single product and a single support contract. A lakehouse requires you to make choices: which table format, which compute engine, how to handle metadata, how to manage file sizes and compaction. Those choices require engineering maturity to make well and operational investment to maintain.

Cold-start latency is a real issue. Warehouses are optimised for query performance. Object storage, even with a table format on top, introduces latency for small queries that hit a lot of metadata. If your primary workload is ad-hoc SQL from business analysts who expect sub-second response times, the lakehouse is a harder sell.

Tooling maturity is improving fast but is still uneven. Delta Lake and Iceberg are both production-ready, but the ecosystem around them — connectors, monitoring, access control, governance tooling — is less complete than what you get with a mature warehouse product. You close those gaps with engineering effort.

When to Choose Each

The warehouse is still the right answer if your data is structured, your team is primarily SQL-focused, your volumes are manageable, and your workloads are predictable analytical queries. The complexity ceiling is lower and the operational overhead is smaller. Don't introduce a lakehouse because it sounds impressive.

The lakehouse becomes the right answer when several of the following are true:

  • ML and analytics share the same data. If data scientists and SQL analysts both need access to the same data but in different forms, a lakehouse eliminates the need for separate storage tiers and synchronisation pipelines between them.
  • Schemas change frequently. Event-driven systems, microservices architectures, and rapidly evolving products generate data with schemas that don't stay stable. A warehouse forces schema changes to be managed as migrations. A lakehouse with schema evolution support handles them as part of normal operation.
  • Volume makes warehouse storage expensive. When data volumes reach the scale where warehouse storage costs become a line item in budget reviews, the economics of object storage start to look compelling.
  • You need to store raw data alongside processed data. The ability to time-travel back to the raw state, reprocess historical data with new logic, or audit exactly what was in a dataset at a given point in time — these are native capabilities in a lakehouse that require significant additional infrastructure in a pure warehouse model.

The Honest Assessment

Most teams don't need a lakehouse yet. The majority of analytics problems that companies actually have — reporting, dashboards, basic cohort analysis, funnel metrics — are solved better by a well-structured warehouse and a competent dbt project than by a distributed lakehouse with multiple compute engines.

The teams that genuinely need a lakehouse, though, really need it. They're the ones paying significant money to move data between a warehouse and an object store on a nightly basis. They're the ones maintaining duplicate datasets in different formats for different consumer teams. They're the ones who can't easily reprocess historical data because transformation logic was applied before storage and the raw data is gone.

Retrofitting a lakehouse onto a warehouse-first architecture is painful. It requires data migration, pipeline rewrites, tooling changes, and retraining the team. The right time to make the choice is before you've built everything around the wrong one — which means understanding your actual workload requirements before you commit, not six months into production.

Written by ATHING

We design and build data infrastructure, automation pipelines, and AI systems for organisations that need them to work.

Talk to Us