Fractional Data Engineer
← All Case Studies

Case Study 05

Building a Data Lakehouse from Scratch on AWS: A Case Study for Complex Organizations

Multinational organization · ~500 employees · $32M revenue · No prior data infrastructure

Results at a Glance

40% under budget

data platform delivered

130+ pipelines

running in Airflow across all sources

72 hours

from development to production for new pipelines

The Challenge: A Complex Organization with No Data Infrastructure

The client was a global organization with hundreds of employees, millions in revenue, and no data infrastructure to speak of. Data lived in disconnected systems across the business: a product database, a CRM, a support platform, marketing analytics, and email tooling. Nobody could bring it together, and there was no engineering foundation to build on. The brief was to build something from scratch that could handle the complexity, keep costs low, and be maintainable by a small team without cutting corners on quality.

How We Built a Cost-Effective Data Lakehouse on AWS

The stack was built around Apache Spark for processing, Apache Airflow for scheduling, and Apache Iceberg for storage, deployed on AWS with Terraform for infrastructure as code and Docker for local development. From the start, the architecture was designed with DataOps principles: versioned environments, reusable base classes that form the foundation of every pipeline, and a structure that makes onboarding new contributors straightforward.

That last point mattered. The goal was not to build something only one person could maintain. Every pipeline follows the same patterns, built on top of shared classes that handle the heavy lifting. Two additional engineers were onboarded after the platform was live and contributing code seamlessly from day one.

We built 130+ Airflow pipelines pulling data from five sources: PostgreSQL for the product, Salesforce for CRM, Freshdesk for customer support, Google Analytics for marketing, and SparkPost for email. Each source had its own complexity and its own data quality challenges. A data dictionary was built alongside the pipelines so that the 7 data users on the team, including non-technical ones, could understand what the data meant and work with it confidently.

New tables and pipelines go from development to production within 72 hours, a process that was established early and has held up as the platform grew.

The entire platform was delivered at 40% under the original budget estimate, without sacrificing scalability or reliability. The infrastructure costs $12K a year to run.

Results

  • Full data lakehouse built from scratch, 40% under budget
  • 130+ Airflow pipelines running reliably across PostgreSQL, Salesforce, Freshdesk, Google Analytics, and SparkPost
  • $12K annual infrastructure cost for a platform serving a 500-person global organization
  • 72-hour deployment cycle from development to production
  • DataOps workflows, versioned environments, and reusable pipeline classes make the platform easy to maintain and extend
  • Data dictionary in place so non-technical users can work with complex data independently
  • Two engineers onboarded after launch and contributing code seamlessly
  • Metabase adopted for self-serve analytics across the organization

The Fractional Data Engineer team has been an outstanding partner to Granularity. They dove straight into our existing data pipeline and infrastructure with minimal guidance needed — getting up to speed quickly and improving on what we had with a level of technical excellence that was immediately evident. What truly set them apart was how far they went beyond the original scope, anticipating needs we hadn't even voiced yet. They delivered faster than expected without ever sacrificing quality, and their communication throughout was clear and consistent — we always knew exactly where things stood. The Fractional Data Engineer team doesn't just do the work; they elevate it. I would recommend them without reservation.

Denis Drost

CRO at Granularity

What's this costing your company?

Run our 2-minute calculator and find out where your company stands.

Calculate Your Data Costs →
Tech:Apache Spark · Apache Airflow · Apache Iceberg · AWS · Terraform · Docker · PostgreSQL · Salesforce · Freshdesk · Google Analytics · Metabase · Python · SQL

Working with a similar challenge?

Book a free 1-hour audit call and we'll tell you exactly what we'd build and why.

Book Your Free Audit Call →