Case Study 05

Building a Data Lakehouse from Scratch on AWS: A Case Study for Complex Organizations

Multinational organization · ~500 employees · $32M revenue · No prior data infrastructure

Results at a Glance

40% under budget

data platform delivered

130+ pipelines

running in Airflow across all sources

72 hours

from development to production for new pipelines

The Challenge: A Complex Organization with No Data Infrastructure

The client was a global organization with hundreds of employees, millions in revenue, and no data infrastructure to speak of. Data lived in disconnected systems across the business: a product database, a CRM, a support platform, marketing analytics, and email tooling. Nobody could bring it together, and there was no engineering foundation to build on. The brief was to build something from scratch that could handle the complexity, keep costs low, and be maintainable by a small team without cutting corners on quality.

How We Built a Cost-Effective Data Lakehouse on AWS

The stack was built around Apache Spark for processing, Apache Airflow for scheduling, and Apache Iceberg for storage, deployed on AWS with Terraform for infrastructure as code and Docker for local development. From the start, the architecture was designed with DataOps principles: versioned environments, reusable base classes that form the foundation of every pipeline, and a structure that makes onboarding new contributors straightforward.

That last point mattered. The goal was not to build something only one person could maintain. Every pipeline follows the same patterns, built on top of shared classes that handle the heavy lifting. Two additional engineers were onboarded after the platform was live and contributing code seamlessly from day one.

We built 130+ Airflow pipelines pulling data from five sources: PostgreSQL for the product, Salesforce for CRM, Freshdesk for customer support, Google Analytics for marketing, and SparkPost for email. Each source had its own complexity and its own data quality challenges. A data dictionary was built alongside the pipelines so that the 7 data users on the team, including non-technical ones, could understand what the data meant and work with it confidently.

New tables and pipelines go from development to production within 72 hours, a process that was established early and has held up as the platform grew.

The entire platform was delivered at 40% under the original budget estimate, without sacrificing scalability or reliability. The infrastructure costs $12K a year to run.

Results

Full data lakehouse built from scratch, 40% under budget
130+ Airflow pipelines running reliably across PostgreSQL, Salesforce, Freshdesk, Google Analytics, and SparkPost
$12K annual infrastructure cost for a platform serving a 500-person global organization
72-hour deployment cycle from development to production
DataOps workflows, versioned environments, and reusable pipeline classes make the platform easy to maintain and extend
Data dictionary in place so non-technical users can work with complex data independently
Two engineers onboarded after launch and contributing code seamlessly
Metabase adopted for self-serve analytics across the organization

How simple is it for you to access your data?

Take our 2-minute assessment and get a clear path forward.

Get Your 90-Day Data Plan →

Tech:Apache Spark · Apache Airflow · Apache Iceberg · AWS · Terraform · Docker · PostgreSQL · Salesforce · Freshdesk · Google Analytics · Metabase · Python · SQL

← PreviousReplacing a Legacy ETL Tool with AWS: A SaaS Migration Case Study

Working with a similar challenge?

Book a free 1-hour data strategy call and we'll tell you exactly what we'd build and why.

Book a Free Data Strategy Call →