Case Study 05
Building a Data Lakehouse from Scratch on AWS: A Case Study for Complex Organizations
Multinational organization · ~500 employees · $32M revenue · No prior data infrastructure
Results at a Glance
40% under budget
data platform delivered
130+ pipelines
running in Airflow across all sources
72 hours
from development to production for new pipelines
The Challenge: A Complex Organization with No Data Infrastructure
The client was a global organization with hundreds of employees, millions in revenue, and no data infrastructure to speak of. Data lived in disconnected systems across the business: a product database, a CRM, a support platform, marketing analytics, and email tooling. Nobody could bring it together, and there was no engineering foundation to build on. The brief was to build something from scratch that could handle the complexity, keep costs low, and be maintainable by a small team without cutting corners on quality.
How We Built a Cost-Effective Data Lakehouse on AWS
The stack was built around Apache Spark for processing, Apache Airflow for scheduling, and Apache Iceberg for storage, deployed on AWS with Terraform for infrastructure as code and Docker for local development. From the start, the architecture was designed with DataOps principles: versioned environments, reusable base classes that form the foundation of every pipeline, and a structure that makes onboarding new contributors straightforward.
That last point mattered. The goal was not to build something only one person could maintain. Every pipeline follows the same patterns, built on top of shared classes that handle the heavy lifting. Two additional engineers were onboarded after the platform was live and contributing code seamlessly from day one.
We built 130+ Airflow pipelines pulling data from five sources: PostgreSQL for the product, Salesforce for CRM, Freshdesk for customer support, Google Analytics for marketing, and SparkPost for email. Each source had its own complexity and its own data quality challenges. A data dictionary was built alongside the pipelines so that the 7 data users on the team, including non-technical ones, could understand what the data meant and work with it confidently.
New tables and pipelines go from development to production within 72 hours, a process that was established early and has held up as the platform grew.
The entire platform was delivered at 40% under the original budget estimate, without sacrificing scalability or reliability. The infrastructure costs $12K a year to run.
Results
- Full data lakehouse built from scratch, 40% under budget
- 130+ Airflow pipelines running reliably across PostgreSQL, Salesforce, Freshdesk, Google Analytics, and SparkPost
- $12K annual infrastructure cost for a platform serving a 500-person global organization
- 72-hour deployment cycle from development to production
- DataOps workflows, versioned environments, and reusable pipeline classes make the platform easy to maintain and extend
- Data dictionary in place so non-technical users can work with complex data independently
- Two engineers onboarded after launch and contributing code seamlessly
- Metabase adopted for self-serve analytics across the organization
“The Fractional Data Engineer team has been an outstanding partner to Granularity. They dove straight into our existing data pipeline and infrastructure with minimal guidance needed — getting up to speed quickly and improving on what we had with a level of technical excellence that was immediately evident. What truly set them apart was how far they went beyond the original scope, anticipating needs we hadn't even voiced yet. They delivered faster than expected without ever sacrificing quality, and their communication throughout was clear and consistent — we always knew exactly where things stood. The Fractional Data Engineer team doesn't just do the work; they elevate it. I would recommend them without reservation.”
What's this costing your company?
Run our 2-minute calculator and find out where your company stands.
Working with a similar challenge?
Book a free 1-hour audit call and we'll tell you exactly what we'd build and why.
Book Your Free Audit Call →