Terraform, Terragrunt & Atlantis: A Production-Grade IaC Workflow
Infrastructure as Code at scale comes with a familiar set of problems: duplicated module calls across environments, state file collisions, no PR-based review for infra changes, and sprawling root modules that nobody dares touch. This post walks through how I structure Terraform modules, layer Terragrunt on top to eliminate repetition, and wire Atlantis into GitHub pull requests so that infrastructure changes go through the same review workflow as application code.
The Problem with Vanilla Terraform at Scale
A single Terraform workspace works fine for a small project. As soon as you have multiple environments (dev, staging, prod) and multiple AWS accounts, you hit the same walls:
- DRY violation — you copy
backend.tf,provider.tf, and module calls into every environment directory and then drift slowly apart - State isolation — one mistake in a shared workspace can destroy prod
- No guardrails — anyone with credentials can run
terraform applylocally - No audit trail — you cannot tell from git who applied what and when
Terragrunt solves the first two. Atlantis solves the last two.
Part 1: Terraform Module Structure
A clean module structure separates reusable modules from environment-specific root modules.
infrastructure/
├── modules/ # reusable building blocks
│ ├── vpc/
│ ├── eks/
│ ├── rds/
│ └── ecs-service/
└── live/ # environment root modules (consumed by Terragrunt)
├── _envcommon/ # shared variable definitions
├── dev/
│ ├── account.hcl
│ └── us-east-1/
│ ├── region.hcl
│ ├── vpc/
│ ├── eks/
│ └── rds/
├── staging/
└── prod/
Writing a reusable module
Every module under modules/ follows the same layout:
modules/ecs-service/
├── main.tf
├── variables.tf
├── outputs.tf
└── versions.tf
versions.tf — pin providers so modules do not break on consumer upgrades:
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
variables.tf — explicit types and descriptions, no defaults for values that differ per environment:
variable "cluster_name" {
type = string
description = "Name of the ECS cluster to deploy the service into"
}
variable "service_name" {
type = string
description = "ECS service name, used as a prefix for all resources"
}
variable "container_image" {
type = string
description = "Full ECR image URI including tag"
}
variable "desired_count" {
type = number
description = "Desired number of running tasks"
default = 2
}
variable "cpu" {
type = number
description = "Task CPU units (256, 512, 1024, 2048, 4096)"
default = 256
}
variable "memory" {
type = number
description = "Task memory in MiB"
default = 512
}
variable "environment_variables" {
type = map(string)
description = "Non-secret environment variables injected into the container"
default = {}
}
variable "secrets" {
type = map(string)
description = "Map of env var name to SSM Parameter Store ARN"
default = {}
}
variable "target_group_arn" {
type = string
description = "ALB target group ARN for load balancer attachment"
}
variable "subnets" {
type = list(string)
description = "Subnet IDs the ECS tasks run in"
}
variable "security_groups" {
type = list(string)
description = "Security group IDs attached to the ECS tasks"
}
variable "tags" {
type = map(string)
description = "Tags applied to all resources"
default = {}
}
main.tf — the actual resources, composed from inputs only:
resource "aws_ecs_task_definition" "this" {
family = var.service_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
container_definitions = jsonencode([{
name = var.service_name
image = var.container_image
environment = [
for k, v in var.environment_variables : { name = k, value = v }
]
secrets = [
for k, v in var.secrets : { name = k, valueFrom = v }
]
portMappings = [{
containerPort = 8080
protocol = "tcp"
}]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/${var.service_name}"
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "ecs"
}
}
}])
tags = var.tags
}
resource "aws_ecs_service" "this" {
name = var.service_name
cluster = var.cluster_name
task_definition = aws_ecs_task_definition.this.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.subnets
security_groups = var.security_groups
assign_public_ip = false
}
load_balancer {
target_group_arn = var.target_group_arn
container_name = var.service_name
container_port = 8080
}
lifecycle {
ignore_changes = [desired_count]
}
tags = var.tags
}
resource "aws_cloudwatch_log_group" "this" {
name = "/ecs/${var.service_name}"
retention_in_days = 30
tags = var.tags
}
data "aws_region" "current" {}
outputs.tf — expose what downstream modules need:
output "service_name" {
value = aws_ecs_service.this.name
}
output "task_definition_arn" {
value = aws_ecs_task_definition.this.arn
}
output "log_group_name" {
value = aws_cloudwatch_log_group.this.name
}
Module versioning with git tags
Pin modules by git tag, not by path, so environments can upgrade independently:
module "ecs_service" {
source = "git::https://github.com/your-org/terraform-modules.git//modules/ecs-service?ref=v1.4.2"
# ...
}
Part 2: Terragrunt — DRY Environments
Terragrunt wraps Terraform and solves three things: backend configuration inheritance, provider inheritance, and dependency graphs between stacks.
Root terragrunt.hcl
Place this at the top of live/:
# live/terragrunt.hcl
locals {
account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
region_vars = read_terragrunt_config(find_in_parent_folders("region.hcl"))
account_id = local.account_vars.locals.account_id
account_name = local.account_vars.locals.account_name
aws_region = local.region_vars.locals.aws_region
}
# Generate backend config for every child module automatically
generate "backend" {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
terraform {
backend "s3" {
bucket = "my-org-terraform-state-${local.account_id}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "${local.aws_region}"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
EOF
}
# Generate provider block for every child module
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.aws_region}"
assume_role {
role_arn = "arn:aws:iam::${local.account_id}:role/TerraformDeployRole"
}
default_tags {
tags = {
ManagedBy = "terraform"
Environment = "${local.account_name}"
}
}
}
EOF
}
Account and region files
# live/prod/account.hcl
locals {
account_id = "123456789012"
account_name = "prod"
}
# live/prod/us-east-1/region.hcl
locals {
aws_region = "us-east-1"
}
A leaf module’s terragrunt.hcl
# live/prod/us-east-1/ecs-service/api/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
include "envcommon" {
path = "${dirname(find_in_parent_folders())}/_envcommon/ecs-service.hcl"
expose = true
}
# Declare dependency on ECS cluster
dependency "cluster" {
config_path = "../cluster"
mock_outputs = {
cluster_name = "mock-cluster"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}
dependency "vpc" {
config_path = "../../vpc"
mock_outputs = {
private_subnet_ids = ["subnet-00000000"]
app_security_group_id = "sg-00000000"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}
dependency "alb" {
config_path = "../alb"
mock_outputs = {
api_target_group_arn = "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/mock/abc"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}
terraform {
source = "git::https://github.com/your-org/terraform-modules.git//modules/ecs-service?ref=v1.4.2"
}
inputs = merge(
include.envcommon.locals.common_inputs,
{
cluster_name = dependency.cluster.outputs.cluster_name
service_name = "api"
container_image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:latest"
desired_count = 3
cpu = 512
memory = 1024
target_group_arn = dependency.alb.outputs.api_target_group_arn
subnets = dependency.vpc.outputs.private_subnet_ids
security_groups = [dependency.vpc.outputs.app_security_group_id]
environment_variables = {
APP_ENV = "production"
LOG_LEVEL = "info"
}
secrets = {
DATABASE_URL = "arn:aws:ssm:us-east-1:123456789012:parameter/prod/api/database_url"
API_KEY = "arn:aws:ssm:us-east-1:123456789012:parameter/prod/api/api_key"
}
}
)
Shared envcommon
# live/_envcommon/ecs-service.hcl
locals {
account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
common_inputs = {
tags = {
Team = "platform"
CostCenter = "engineering"
}
}
}
Running across all environments
# Plan everything under prod
terragrunt run-all plan --terragrunt-working-dir live/prod
# Apply a single stack
terragrunt apply --terragrunt-working-dir live/prod/us-east-1/ecs-service/api
# Destroy in dependency order (reverse)
terragrunt run-all destroy --terragrunt-working-dir live/dev
Terragrunt builds a directed acyclic graph from dependency blocks and applies stacks in the correct order automatically.
Part 3: Atlantis — PR-Driven Infrastructure
Atlantis is a self-hosted Go service that listens to GitHub webhooks and runs terraform plan / terraform apply in response to pull request comments. Infrastructure changes go through the same review process as code.
How it works
Developer opens PR
│
▼
GitHub webhook → Atlantis
│
▼
atlantis plan ← posted as PR comment
│
▼
Reviewer approves PR
│
▼
atlantis apply ← comment on PR
│
▼
Terraform apply runs, output posted back to PR
│
▼
PR merged
Deploying Atlantis on ECS Fargate
# modules/atlantis/main.tf (simplified)
resource "aws_ecs_task_definition" "atlantis" {
family = "atlantis"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = 1024
memory = 2048
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.atlantis_task.arn
container_definitions = jsonencode([{
name = "atlantis"
image = "ghcr.io/runatlantis/atlantis:v0.28.0"
environment = [
{ name = "ATLANTIS_REPO_ALLOWLIST", value = "github.com/your-org/*" },
{ name = "ATLANTIS_GH_USER", value = "atlantis-bot" },
{ name = "ATLANTIS_REPO_CONFIG_FILE", value = "/atlantis.yaml" },
{ name = "ATLANTIS_TERRAFORM_VERSION", value = "1.7.0" },
{ name = "ATLANTIS_PORT", value = "4141" },
]
secrets = [
{ name = "ATLANTIS_GH_TOKEN", valueFrom = aws_ssm_parameter.gh_token.arn },
{ name = "ATLANTIS_GH_WEBHOOK_SECRET", valueFrom = aws_ssm_parameter.webhook_secret.arn },
{ name = "ATLANTIS_ATLANTIS_URL", valueFrom = aws_ssm_parameter.atlantis_url.arn },
]
portMappings = [{ containerPort = 4141 }]
mountPoints = [{
sourceVolume = "atlantis-data"
containerPath = "/atlantis"
}]
}])
volume {
name = "atlantis-data"
efs_volume_configuration {
file_system_id = aws_efs_file_system.atlantis.id
root_directory = "/"
}
}
}
# Atlantis needs broad IAM permissions to manage infrastructure
resource "aws_iam_role_policy" "atlantis_task" {
name = "atlantis-task-policy"
role = aws_iam_role.atlantis_task.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["sts:AssumeRole"]
Resource = [
"arn:aws:iam::*:role/TerraformDeployRole"
]
}]
})
}
Each AWS account has a TerraformDeployRole that trusts the Atlantis task role to assume it — no static credentials needed.
atlantis.yaml — repo-level config
This file lives at the root of your live/ repo and tells Atlantis how to map directories to projects:
version: 3
automerge: false
delete_source_branch_on_merge: false
repos:
- id: github.com/your-org/infrastructure
apply_requirements:
- approved
- mergeable
workflow: terragrunt
allowed_overrides:
- workflow
- apply_requirements
allow_custom_workflows: true
workflows:
terragrunt:
plan:
steps:
- env:
name: TERRAGRUNT_TFPATH
command: "which terraform"
- run: terragrunt plan -no-color -out=$PLANFILE
apply:
steps:
- run: terragrunt apply -no-color $PLANFILE
projects:
- name: prod-us-east-1-vpc
dir: live/prod/us-east-1/vpc
workspace: default
workflow: terragrunt
autoplan:
when_modified:
- "*.hcl"
- "*.tf"
enabled: true
- name: prod-us-east-1-eks
dir: live/prod/us-east-1/eks
workspace: default
workflow: terragrunt
autoplan:
when_modified:
- "*.hcl"
- "*.tf"
enabled: true
- name: prod-us-east-1-api
dir: live/prod/us-east-1/ecs-service/api
workspace: default
workflow: terragrunt
autoplan:
when_modified:
- "*.hcl"
- "*.tf"
enabled: true
With autoplan enabled, Atlantis runs terragrunt plan automatically when a PR modifies files in that directory. The plan output is posted as a PR comment.
The PR workflow in practice
# Open a PR that changes the ECS service desired_count from 2 to 3
atlantis plan -p prod-us-east-1-api ← auto or manual trigger
# Atlantis replies:
# Plan: 0 to add, 1 to change, 0 to destroy.
# ~ aws_ecs_service.this
# desired_count: 2 -> 3
# After approval:
atlantis apply -p prod-us-east-1-api
# Atlantis replies:
# Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
All plan and apply output is stored as PR comments — a full audit trail with who triggered it and when.
Locking
Atlantis locks a project directory while a plan is in progress and until apply completes or the lock is discarded. This prevents two engineers from applying conflicting changes simultaneously:
atlantis unlock # discard lock for all projects in the PR
atlantis unlock -p prod-us-east-1-api # discard lock for one project
Part 4: State Management Best Practices
Bootstrap the S3 backend before anything else
# bootstrap/main.tf — apply this once manually before Terragrunt takes over
resource "aws_s3_bucket" "state" {
bucket = "my-org-terraform-state-${var.account_id}"
}
resource "aws_s3_bucket_versioning" "state" {
bucket = aws_s3_bucket.state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "state" {
bucket = aws_s3_bucket.state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "state" {
bucket = aws_s3_bucket.state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
State isolation strategy
| Level | State file path | What it contains |
|---|---|---|
| Network | prod/us-east-1/vpc/terraform.tfstate |
VPC, subnets, route tables |
| Cluster | prod/us-east-1/eks/terraform.tfstate |
EKS cluster, node groups |
| Service | prod/us-east-1/ecs-service/api/terraform.tfstate |
ECS service, task def, IAM |
| Data | prod/us-east-1/rds/terraform.tfstate |
RDS instance, parameter groups |
Separate state files mean a botched apply in one service cannot corrupt another.
Part 5: Security Considerations
Never use long-lived credentials
Atlantis assumes an IAM role per account using sts:AssumeRole. No AWS access keys live in environment variables or SSM — the task role is the only identity.
# In each account, create a role Atlantis can assume
resource "aws_iam_role" "terraform_deploy" {
name = "TerraformDeployRole"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::ATLANTIS_ACCOUNT_ID:role/atlantis-task"
}
Action = "sts:AssumeRole"
}]
})
}
Enforce apply_requirements in Atlantis
apply_requirements:
- approved # at least one approval required
- mergeable # no merge conflicts, all status checks green
This ensures no one can apply infrastructure changes without a code review.
Webhook secret validation
Atlantis validates the HMAC signature on every incoming GitHub webhook. Store the secret in SSM Parameter Store and inject it as an ECS secret — never in plain text.
Putting It All Together
The complete workflow for shipping an infrastructure change:
- Branch off
main, edit the relevantterragrunt.hclunderlive/ - Open a PR — Atlantis detects the changed directory via
autoplanand runsterragrunt plan - Review — teammates review both the code diff and the Terraform plan output in the PR
- Approve the PR
- Comment
atlantis apply -p <project>— Atlantis runs the apply and posts output - Merge the PR — git history now records the change with context
No one ever runs terraform apply from a laptop. Every change is planned, reviewed, applied, and logged through pull requests.
Conclusion
Vanilla Terraform works until you have more than one environment and more than one engineer. Terragrunt eliminates the boilerplate of managing backends, providers, and repeated module calls across environments. Atlantis closes the loop by making pull requests the single point of control for all infrastructure changes — adding the same review discipline to infra that you already have for application code.
The combination gives you: immutable module versions, isolated state per stack, dependency-ordered applies, and a full audit trail in GitHub with zero manual terraform apply runs outside of CI.