May 9, 2026

How to Set Up Apache Airflow for a Small Data Team

airflow · orchestration · data engineering · pipelines

Apache Airflow is powerful. It's also the thing that scares a lot of small data teams into either avoiding it entirely or setting it up wrong and spending months firefighting.

This is for teams of 1 to 3 data people who need real pipeline orchestration and want to get Airflow running without it becoming a project in itself.

Do You Actually Need Airflow?

Not every team needs Airflow. If you have fewer than 3 to 5 pipelines, are running on a tight engineering budget, and aren't doing anything complex, cron jobs, dbt Cloud's built-in scheduler, or Prefect's free tier might be enough. Simpler tools have lower maintenance overhead, and that matters when you don't have dedicated infrastructure engineers.

You're ready for Airflow when you have multiple pipelines with dependencies (pipeline B should only run after pipeline A succeeds), when you need visibility into what ran and whether it succeeded or failed, when you want to retry failed tasks automatically with alerting, or when your team is growing and needs a shared orchestration layer rather than everyone running scripts locally.

If any of those are true, Airflow is worth the setup cost.

Choosing Where to Run It

The first architectural decision: where does Airflow live?

Managed Airflow (recommended for small teams). Astronomer (Cloud or Hybrid), Google Cloud Composer if you're on GCP, and Amazon MWAA if you're on AWS are all managed Airflow services. You get Airflow without managing the underlying infrastructure. For a small team, this is usually the right call. The time you save not managing Airflow's scheduler, workers, and database is worth the cost premium. Astronomer's free tier is a good starting point for teams evaluating the platform.

Self-hosted on Kubernetes. If you're running on AWS or GCP and want more control, you can deploy Airflow yourself via Helm charts on Kubernetes. This is more work to set up and more work to maintain, but gives you full control over the environment. For teams with a DevOps person or infrastructure knowledge, this is viable. For a single data engineer, it's usually too much overhead.

Docker Compose locally, then EC2 or GCE for production. The simplest self-hosted path: run Docker Compose locally to develop and test DAGs, then deploy to a small EC2 or GCE instance for production. This works for early-stage setups with light workloads, but it doesn't scale well and requires more maintenance than a managed service. Not recommended long-term.

Core Concepts Worth Understanding Before You Start

Airflow has a learning curve. These are the five concepts that matter most before you write your first DAG.

DAG (Directed Acyclic Graph): A workflow. In Airflow, everything you want to orchestrate is a DAG: a set of tasks with defined dependencies and a schedule.

Operator: A single task type. BashOperator runs a bash command. PythonOperator runs Python. BigQueryOperator runs a query in BigQuery. Most of what you need exists as a pre-built operator.

Task: An instance of an operator. A DAG is made of tasks; each task is one unit of work.

Schedule: When the DAG runs. You set this as a cron expression (0 6 * * * means every day at 6am) or use Airflow's presets like @daily.

XCom: How tasks pass data to each other. Use sparingly. Heavy XCom usage is a code smell. If you're passing large datasets between tasks, use an intermediate storage layer like S3, GCS, or a database table instead.

A Minimal Working DAG

Here's a simple DAG that runs a Python function, then a bash command, in sequence:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime

def extract_data():
    # your extraction logic here
    print("Extracting data...")

with DAG(
    dag_id="my_first_pipeline",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:

    extract = PythonOperator(
        task_id="extract",
        python_callable=extract_data,
    )

    transform = BashOperator(
        task_id="transform",
        bash_command="dbt run --select my_model",
    )

    extract >> transform  # extract runs first, transform runs after

This is the pattern you'll repeat. Add more tasks, chain dependencies with >>, adjust the schedule.

Project Structure That Scales

Small teams often start with all DAGs in a single folder. That works for 5 DAGs. It doesn't work for 30. Start with structure early:

dags/
  ingestion/
    stripe_sync.py
    hubspot_sync.py
  transformation/
    dbt_daily_run.py
  reporting/
    send_weekly_report.py
plugins/
  custom_operators/
requirements.txt

Group DAGs by function. If you use dbt, keep your dbt invocations in their own DAGs, separate from your ingestion logic.

Monitoring and Alerting

Airflow's UI gives you visibility into what ran and what failed. Set up email alerting for failures from the start. You don't want to discover a broken pipeline from a Slack message asking why a dashboard is wrong.

In your DAG's default arguments:

default_args = {
    "email": ["your-team@yourcompany.com"],
    "email_on_failure": True,
    "email_on_retry": False,
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

Retries plus email on failure is the minimum viable monitoring setup.

The Common Mistakes We See

Using Airflow as a compute engine. Airflow is an orchestrator, not a place to run heavy data processing. If you're doing large transformations, push that work to dbt, Spark, BigQuery, or another compute layer. Airflow just calls it.

Too many dependencies between DAGs. Airflow's cross-DAG dependencies (using TriggerDagRunOperator or sensors) can get complicated fast. If you need complex dependency chains, reconsider your pipeline design first.

Not using task groups. If a DAG has more than 8 to 10 tasks, group them for readability. TaskGroup in Airflow 2.x is the right way to do this.

Putting secrets in DAG code. Use Airflow Connections and Variables, or a secrets backend like AWS Secrets Manager or HashiCorp Vault. Never hardcode credentials.

When You've Outgrown This Setup

The setup above — managed Airflow with a clear project structure and basic monitoring — will carry a small data team through a lot of growth. We've used this pattern to run 130+ pipelines for a multinational organization.

When you start hitting limits (complex dependency graphs, mixed compute environments, dozens of engineers writing DAGs), you'll want to invest more in infrastructure. That's a good problem to have. It means your data platform has become essential to the business.

Don't Know Where to Start?

Setting up Airflow is straightforward when you know what you're doing. When you're not sure whether you need Airflow at all, what to run it on, or how to structure your first DAGs, the starting point is understanding your actual pipeline needs.

← OlderHow to Build a Data Stack from Scratch at a Startup with No Data Engineer

Newer →How to Leverage AI for Data Analytics (You Need a Data Infrastructure First)

Not sure what your data infrastructure needs?

Get your personalized roadmap and find out exactly where to start.

Get Your Free Data Roadmap →