Infrastructure as Code: Why Your ML System Needs to Be Disposable

The best production systems are the ones you can destroy without fear.
This sounds counterintuitive. You spend weeks building a machine learning pipeline, configuring servers, tuning parameters, and getting everything to work. Why would you want to destroy it?
Because if you can't rebuild it automatically, you don't really control it.
The Snowflake Server Problem
Picture this: An ML engineer spends three months getting a demand prediction model running perfectly on a cloud server. She installs packages, configures CUDA, adjusts firewall rules, tweaks memory settings, and eventually achieves beautiful, fast inference. The system hums along serving predictions.
Six months later, something crashes. Or the company needs a second instance for a new region. Or that engineer left and nobody documented what she did.
Someone tries to recreate the environment. They click through cloud console menus, install what they think are the right packages, copy configuration files. Three hours later, they have something that's almost right. The model loads, predictions run, but they're mysteriously 30% slower and occasionally timeout under load.
What went wrong? Maybe the CUDA version is 11.7 instead of 11.8. Maybe a Python dependency got a minor update that changed behavior. Maybe a security group rule is slightly different. The server is a snowflake—unique, fragile, impossible to reproduce exactly.
Production ML systems can't survive on snowflakes.
Infrastructure as Code: Treating Servers Like Software
Infrastructure as Code (IaC) means defining your entire computing environment in text files that can be executed automatically. Not documentation about what buttons to click. Not a wiki page listing commands someone once ran. Actual executable code that provisions servers, configures networking, and sets up environments deterministically.
For ML systems, this is critical because they have weird infrastructure requirements:
GPUs for inference: When request volume spikes, you need hardware acceleration to keep latency acceptable
Specific networking: Your model needs to receive requests and serve predictions with tight SLA requirements
Memory configurations: Feature engineering can be memory-intensive, especially for real-time systems
Autoscaling: Demand patterns are spiky; you need to go from idle to thousands of requests per second
When this infrastructure is code:
It can be reviewed in pull requests like any other code
It can be versioned and rolled back if something breaks
It can be executed identically across dev, staging, and production
It can be destroyed and recreated without tribal knowledge
No clicking. No "I think I remember what I did last time." No single person who holds all the knowledge about how production is configured.
Enter Terraform
Terraform is the industry standard for infrastructure as code. You write declarative configuration files describing what infrastructure you want, and Terraform figures out how to create it. It works across all major cloud providers (AWS, GCP, Azure) with a consistent syntax.
Here's what a basic Terraform configuration looks like for provisioning an ML inference server:
hcl
# Which cloud provider
provider "google" {
project = "ml-demand-predictor"
region = "us-central1"
}
# The compute instance itself
resource "google_compute_instance" "ml_server" {
name = "demand-predictor-prod"
machine_type = "n1-standard-4"
zone = "us-central1-a"
# Attach a GPU for fast inference
guest_accelerator {
type = "nvidia-tesla-t4"
count = 1
}
# Boot disk with ML-optimized image
boot_disk {
initialize_params {
image = "deeplearning-platform-release/pytorch-latest-gpu"
size = 100
}
}
# Network configuration
network_interface {
network = "default"
access_config {} # External IP
}
# Tags for firewall rules
tags = ["ml-inference"]
}
# Firewall: allow API traffic
resource "google_compute_firewall" "allow_api" {
name = "allow-ml-inference"
network = "default"
allow {
protocol = "tcp"
ports = ["8000"] # FastAPI default
}
source_ranges = ["0.0.0.0/0"]
target_tags = ["ml-inference"]
}
This looks verbose compared to clicking buttons in a console, but consider what you've gained:
Version Control: This file lives in Git. You can see who changed what and when. You can diff between versions. You can roll back to last week's infrastructure if today's breaks something.
Code Review: Infrastructure changes go through pull requests. A teammate can spot that you're opening port 8000 to the entire internet and suggest restricting it to your VPC.
Documentation: The configuration is the documentation. It's always up to date because it's literally what builds the infrastructure.
Reproducibility: Run terraform apply and you get exactly this infrastructure. Every time. No variation. No drift.
The Terraform Workflow
Terraform has a simple, powerful workflow:
1. Initialize
bash
terraform init
This downloads provider plugins and prepares your working directory. Run once per project.
2. Plan
bash
terraform plan
```
Terraform shows exactly what it will create, modify, or destroy. This is your safety check. Review it carefully before proceeding.
Example output:
```
Terraform will perform the following actions:
# google_compute_instance.ml_server will be created
+ resource "google_compute_instance" "ml_server" {
+ name = "demand-predictor-prod"
+ machine_type = "n1-standard-4"
+ zone = "us-central1-a"
...
}
Plan: 2 to add, 0 to change, 0 to destroy.
3. Apply
bash
terraform apply
Terraform asks for confirmation, then builds everything. Watch resources being created in real-time. When it finishes, your infrastructure is live.
4. Destroy (When Needed)
bash
terraform destroy
Tears down everything Terraform created. Useful for development environments you only need temporarily, or for the ultimate reproducibility test.
Why GPU Instances Matter for ML Inference
The configuration above includes an NVIDIA T4 GPU. Why does an inference server need a GPU?
Consider a real-time demand prediction system for ridesharing. On a normal Tuesday afternoon, you might handle 50 prediction requests per minute. Each request needs:
Feature engineering: combine user location, weather, time, recent ride history
Model inference: run the trained neural network
Postprocessing: format the output, add confidence scores
On CPU, this might take 50-100ms per request. Fine for low volume.
But demand is spiky. When it starts raining during rush hour, requests jump to 5,000 per minute. Or a concert lets out and everyone in a neighborhood requests rides simultaneously. Your 50ms per request CPU inference becomes a bottleneck. Requests queue up, users wait, and complaints roll in.
With a GPU:
Batch multiple requests together for parallel processing
Feature engineering happens in optimized tensor operations
Inference drops from 50ms to 5ms per request
You can handle the spike without users noticing
The T4 is the sweet spot for inference—enough power to handle serious load, not so expensive that it kills your budget. At $0.35/hour on GCP, it's cheaper to run one GPU instance than to run 10 CPU instances to achieve the same throughput.
Remote State: Making Infrastructure a Team Sport
Here's a subtle but critical piece of Terraform configuration:
hcl
terraform {
backend "gcs" {
bucket = "company-terraform-state"
prefix = "ml-systems/demand-predictor"
}
}
This stores Terraform's state file—its record of what infrastructure exists—in a cloud storage bucket instead of on your laptop.
Why does this matter?
Terraform needs to track what resources it created so it knows what to update or destroy later. If this state lives on your laptop and a teammate runs Terraform, they won't know what you already built. They might accidentally create duplicate resources, or try to modify resources they don't have permission for.
With remote state:
Everyone on the team sees the same infrastructure state
Terraform can lock the state file so two people can't apply changes simultaneously
The state is backed up automatically
New team members can immediately see and modify existing infrastructure
For a solo side project, remote state might feel like overkill. But building production-grade habits from day one means you're ready when the project grows.
The Iron-Clad Test: Destroy and Rebuild
Here's how you prove your infrastructure is truly code:
bash
# Burn it down
terraform destroy
# Rebuild from nothing
terraform apply
# Verify it works
ssh user@$(terraform output -raw server_ip)
nvidia-smi # Check GPU is attached
curl localhost:8000/health # Check API is running
If you can do this without touching the cloud console, without manual configuration steps, without SSH'ing in to fix things, you've won. Your infrastructure is reproducible. More importantly, it's disposable.
Being disposable is a superpower. It means:
You can test infrastructure changes without fear
You can spin up temporary environments for load testing
You can recover from disasters by rerunning Terraform
You can scale from 1 region to 10 by changing a parameter
Why This Matters for ML Systems Specifically
Traditional web applications are relatively stateless. If a server dies, spin up another and route traffic to it. ML systems are different:
Models are large: A PyTorch model checkpoint can be several gigabytes. You need to think about where it's stored and how it's loaded.
Inference is stateful: Feature engineering often requires recent historical data. Your inference server needs access to databases or caches.
Performance is critical: A web application that takes 200ms instead of 100ms is fine. A prediction that takes 5 seconds instead of 50ms means users abandon requests.
Scaling is complex: You can't just add more servers. You need to think about GPU utilization, batch sizes, request routing.
Infrastructure as Code doesn't solve these problems directly, but it makes them manageable. When your infrastructure is code:
You can test different machine types to optimize cost vs. performance
You can provision instances with the right mix of CPU, memory, and GPU
You can automate the deployment of model artifacts and configuration
You can quickly spin up infrastructure in new regions when latency matters
The Hidden Benefit: Confidence
The real win isn't technical—it's psychological. When your infrastructure is code, you develop confidence.
Confidence that if production catches fire, you can rebuild it in minutes. Confidence that you can hand the project to someone else and they'll get an identical environment. Confidence that scaling from 1 server to 20 is a parameter change, not a crisis.
ML systems are complex enough. The model might drift as data distributions change. Feature pipelines might break when upstream data sources evolve. Predictions might degrade in ways that are hard to detect.
You don't need infrastructure to be another source of mystery and fragility. Make it boring. Make it reproducible. Make it code.
Getting Started
If you're building an ML system and still clicking buttons in cloud consoles, here's your homework:
Pick a tool: Terraform is the industry standard, but Pulumi (code-based) or CloudFormation (AWS-specific) work too.
Start small: Define one compute instance. Get it working. Destroy it and rebuild it.
Add networking: Configure firewalls, load balancers, VPCs. Make them code.
Automate provisioning: Write startup scripts that install dependencies and configure the environment.
Set up remote state: Store your Terraform state in a cloud bucket, not your laptop.
Test destruction: Regularly run
terraform destroyandterraform applyto verify everything rebuilds correctly.
You'll know you've succeeded when destroying your production infrastructure feels calm instead of terrifying. Because you know you can rebuild it, exactly as it was, in under 10 minutes.
That's the power of disposable infrastructure.
