All Articles

Containerizing AI Agents for Production: From Docker to Kubernetes

Moving AI agents from local development to production requires more than docker build && docker run. Production containerization demands optimization for size, security, startup time, and resource efficiency—especially critical when dealing with large AI models and unpredictable workloads.

Optimized Docker Strategy: Multi-Stage Builds

Here’s a production-optimized Dockerfile that reduces image size while maintaining functionality:

# Build stage - heavy dependencies
FROM python:3.11-slim as builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Production stage - minimal runtime
FROM python:3.11-slim
WORKDIR /app

# Copy only installed packages
COPY --from=builder /root/.local /root/.local
COPY src/ ./src/
COPY config/ ./config/

# Non-root user for security
RUN adduser --disabled-password --gecos '' appuser
USER appuser

# Health check for Kubernetes
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
  CMD python -c "import requests; requests.get('http://localhost:8000/health', timeout=5)"

EXPOSE 8000
CMD ["python", "-m", "src.agent_server"]

Result: 73% smaller image size compared to single-stage builds, faster deployments, improved security posture.

Kubernetes Deployment Patterns

Production AI agents need sophisticated orchestration. Here’s a complete Kubernetes deployment with optimized resource allocation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-agent
  template:
    metadata:
      labels:
        app: ai-agent
    spec:
      containers:
      - name: ai-agent
        image: your-registry/ai-agent:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi" 
            cpu: "1000m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        env:
        - name: MODEL_CACHE_DIR
          value: "/tmp/models"
        volumeMounts:
        - name: model-cache
          mountPath: /tmp/models
      volumes:
      - name: model-cache
        emptyDir:
          sizeLimit: 5Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Infrastructure as Code: Terraform Modules

Deploy consistent clusters across AWS and GCP:

# modules/eks-cluster/main.tf
resource "aws_eks_cluster" "ai_agents" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster.arn
  version  = "1.28"

  vpc_config {
    subnet_ids = var.subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }
}

resource "aws_eks_node_group" "ai_agents" {
  cluster_name    = aws_eks_cluster.ai_agents.name
  node_group_name = "ai-agents-nodes"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids

  instance_types = ["m5.large", "m5.xlarge"]
  capacity_type  = "SPOT"  # 60-70% cost savings

  scaling_config {
    desired_size = 2
    max_size     = 8
    min_size     = 1
  }
}

Cost Comparison: Managed vs Self-Managed

Configuration AWS EKS GCP GKE Monthly Cost (3 nodes)
Managed + On-Demand $216 $180 Higher reliability
Managed + Spot $95 $78 60-70% savings
Self-Managed $180 $144 Operational overhead

Recommendation: Managed Kubernetes with spot instances provides optimal cost-performance balance for most AI agent workloads.

Production Considerations

  • Security: Non-root containers, minimal base images, vulnerability scanning
  • Monitoring: Application metrics, resource utilization, custom AI-specific metrics
  • Scaling: HPA configuration for variable AI workloads
  • Reliability: Health checks, graceful shutdowns, circuit breakers
  • Cost Control: Resource limits, spot instances, automatic scaling policies

Next Steps

Your containerized AI agents are now production-ready. The next article explores serverless deployment patterns—when containers might be overkill and how to optimize for cost and cold start performance.

Start simple, measure everything, and evolve toward the complexity your use case demands.