Case Study 03

Building an AWS Event Tracking Pipeline from Scratch: A SaaS Case Study

Flexible office marketplace SaaS · ~150 employees · Fast-growing, investment-backed · Full Pentaho to AWS migration

Results at a Glance

2 weeks

to get the pipeline live

5,000+ events/day

processed through the pipeline

Pentaho to AWS

legacy infrastructure modernised

The Challenge: Legacy Infrastructure That Could Not Keep Up

The startup had just closed an investment round and was scaling fast. They had two data analysts on the team and were hiring six more. The problem was clear: their existing Pentaho setup was built for a smaller operation. It could handle the database, but not the variety of data sources a growing analytics team would need. And it had no room for anything more advanced. They needed a proper cloud infrastructure that could support more analysts, more sources, and more sophisticated work, without the team constantly running into the limits of the tools underneath them. One of the first pipelines built on that new foundation was pulling user behaviour data from Segment.io, the event tracking tool they were using to understand how users interacted with their product.

How We Built the Event Tracking Pipeline on AWS

We started by understanding the data. Segment.io was generating a high volume of events every day, but the data was messy. Before anything could be used for product decisions or user engagement analysis, it needed to be properly modelled and cleaned.

The pipeline we built worked like this: AWS Lambda collected the raw event data from Segment.io and landed it in an S3 bucket acting as a landing zone. From there, AWS Glue handled the transformation layer, processing and structuring the data before it was stored in PostgreSQL and made available for the analytics team to query.

A key design decision was partitioning the data from the start. With thousands of events coming in daily, processing everything every time would have been wasteful and expensive. Partitioning meant data was only processed once, keeping compute costs under control as volume grew.

The Segment.io data was also heavily polluted, so a significant part of the work was data modelling: understanding what each event actually represented, cleaning up the noise, and building a structure that was actually useful to work with.

Once the data reached the analysts, the work did not stop there. We worked closely with the team to educate them on what each data point meant and how to interpret it correctly. And when they needed additional data points, like specific user IDs or particular click events that were not being tracked yet, we collaborated with the development team to instrument those things properly in Segment.io.

Results

Event tracking pipeline live in under two weeks, processing 5,000+ events per day
Clean, partitioned data in PostgreSQL ready for the analytics team to query from day one
Computing costs kept under control through smart data partitioning
Data quality issues in Segment.io identified, modelled around, and resolved
Analytics team educated on what the data points mean and how to use them correctly
New tracking requirements scoped and implemented together with the development team
Infrastructure built to support a growing team, not just the two analysts already in place

What's this costing your company?

Run our 2-minute calculator and find out where your company stands.

Calculate Your Data Costs →

Tech:AWS Lambda · AWS S3 · AWS Glue · PostgreSQL · Segment.io · Data Modelling · Data Partitioning · Cloud Infrastructure

← PreviousBuilding a Self-Serve Analytics Culture with Metabase: A Case Study for Startups Without a Data Team

Next →Replacing a Legacy ETL Tool with AWS: A SaaS Migration Case Study

Working with a similar challenge?

Book a free 1-hour audit call and we'll tell you exactly what we'd build and why.

Book Your Free Audit Call →