Case Study 03
Building an AWS Event Tracking Pipeline from Scratch: A SaaS Case Study
Flexible office marketplace SaaS · ~150 employees · Fast-growing, investment-backed · Full Pentaho to AWS migration
Results at a Glance
2 weeks
to get the pipeline live
5,000+ events/day
processed through the pipeline
Pentaho to AWS
legacy infrastructure modernised
The Challenge: Legacy Infrastructure That Could Not Keep Up
The startup had just closed an investment round and was scaling fast. They had two data analysts on the team and were hiring six more. The problem was clear: their existing Pentaho setup was built for a smaller operation. It could handle the database, but not the variety of data sources a growing analytics team would need. And it had no room for anything more advanced. They needed a proper cloud infrastructure that could support more analysts, more sources, and more sophisticated work, without the team constantly running into the limits of the tools underneath them. One of the first pipelines built on that new foundation was pulling user behaviour data from Segment.io, the event tracking tool they were using to understand how users interacted with their product.
How We Built the Event Tracking Pipeline on AWS
We started by understanding the data. Segment.io was generating a high volume of events every day, but the data was messy. Before anything could be used for product decisions or user engagement analysis, it needed to be properly modelled and cleaned.
The pipeline we built worked like this: AWS Lambda collected the raw event data from Segment.io and landed it in an S3 bucket acting as a landing zone. From there, AWS Glue handled the transformation layer, processing and structuring the data before it was stored in PostgreSQL and made available for the analytics team to query.
A key design decision was partitioning the data from the start. With thousands of events coming in daily, processing everything every time would have been wasteful and expensive. Partitioning meant data was only processed once, keeping compute costs under control as volume grew.
The Segment.io data was also heavily polluted, so a significant part of the work was data modelling: understanding what each event actually represented, cleaning up the noise, and building a structure that was actually useful to work with.
Once the data reached the analysts, the work did not stop there. We worked closely with the team to educate them on what each data point meant and how to interpret it correctly. And when they needed additional data points, like specific user IDs or particular click events that were not being tracked yet, we collaborated with the development team to instrument those things properly in Segment.io.
Results
- Event tracking pipeline live in under two weeks, processing 5,000+ events per day
- Clean, partitioned data in PostgreSQL ready for the analytics team to query from day one
- Computing costs kept under control through smart data partitioning
- Data quality issues in Segment.io identified, modelled around, and resolved
- Analytics team educated on what the data points mean and how to use them correctly
- New tracking requirements scoped and implemented together with the development team
- Infrastructure built to support a growing team, not just the two analysts already in place
What's this costing your company?
Run our 2-minute calculator and find out where your company stands.
Working with a similar challenge?
Book a free 1-hour audit call and we'll tell you exactly what we'd build and why.
Book Your Free Audit Call →