DevOps

Kubernetes at Scale: Lessons from Running 200+ Microservices

January 28, 2025
9 min read
Amar Sohail
Kubernetes at Scale: Lessons from Running 200+ Microservices
KubernetesDockerCloud NativeMicroservicesCI/CD PipelinesAWSTerraformDistributed Systems

TL;DR

When I tell people we run over 200 microservices on Kubernetes, the first question is usually "why that many?" The honest answer is that it did not start at 200. It started at 12, then grew organically as teams shipped features, and by the time we looked up, we had a sprawling distributed system with its own gravitational pull. This post is about the lessons we learned making that system reliable — and the mistakes that taught us those lessons the hard way.

When I tell people we run over 200 microservices on Kubernetes, the first question is usually "why that many?" The honest answer is that it did not start at 200. It started at 12, then grew organically as teams shipped features, and by the time we looked up, we had a sprawling distributed system with its own gravitational pull. This post is about the lessons we learned making that system reliable — and the mistakes that taught us those lessons the hard way.

The Cluster Architecture

We run on AWS EKS across three availability zones. The infrastructure is fully managed through Terraform, and at this point our IaC repository is one of the most critical codebases in the company. Our cluster topology looks roughly like this:

  • 3 node groups: general workloads (m6i.xlarge), memory-intensive services (r6i.2xlarge), and GPU nodes for our ML inference workloads (g5.xlarge)
  • Karpenter for node autoscaling, replacing the older Cluster Autoscaler
  • Istio as the service mesh
  • ArgoCD for GitOps-based deployments
  • Prometheus + Grafana + Loki for the observability stack
# Terraform EKS module configuration (simplified)
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = "prod-main"
  cluster_version = "1.28"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m6i.xlarge"]
      min_size       = 6
      max_size       = 30
      desired_size   = 12

      labels = {
        workload-type = "general"
      }
    }

    memory_optimized = {
      instance_types = ["r6i.2xlarge"]
      min_size       = 2
      max_size       = 10
      desired_size   = 4

      labels = {
        workload-type = "memory-intensive"
      }

      taints = [{
        key    = "workload-type"
        value  = "memory-intensive"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

This setup supports roughly 800-1200 pods at steady state, spiking to 2000+ during peak traffic.

Resource Requests and Limits: The Silent Killer

If I could go back and fix one thing from day one, it would be enforcing sane resource requests and limits. For the first year, most teams set their resource requests based on vibes. Services would request 2 CPU cores and 4Gi of memory because those were round numbers, not because anyone profiled the application.

The result: massive overprovisioning. We were paying for 3x the compute we actually needed. Pods were spread across too many nodes because Kubernetes thought they were hungrier than they were, and the scheduler could not bin-pack efficiently.

We fixed this in three steps:

Step 1: Profile Everything with VPA

We deployed the Vertical Pod Autoscaler in recommendation mode across every namespace:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: order-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  updatePolicy:
    updateMode: "Off"  # Recommendation only, no auto-updates
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2000m
          memory: 4Gi

After two weeks of data collection, we had actual usage profiles for every service. The gap between requested and actual usage was staggering — most services used less than 20% of their requested CPU.

Step 2: Right-size and Enforce

We wrote a script that generated new resource manifests based on VPA recommendations, adding a 30% buffer above the P99 usage. Then we added an OPA Gatekeeper policy that rejected any deployment where requests exceeded twice the VPA recommendation without an explicit exemption.

Step 3: Switch to Karpenter

The old Cluster Autoscaler worked at the node group level and was slow to react. Karpenter makes provisioning decisions per-pod and can mix instance types intelligently. This alone saved us about 35% on our EC2 bill because Karpenter would provision a c6i.large for a CPU-bound pod instead of always defaulting to the node group's m6i.xlarge.

CI/CD Pipelines: Getting to 50 Deploys Per Day

With 200+ services and a dozen teams, the deployment pipeline has to be fast and trustworthy. Nobody should be scared to deploy on a Thursday afternoon.

Our CI/CD pipeline runs on GitHub Actions for build and test, then hands off to ArgoCD for deployment. Here is the rough flow:

  1. PR opened: Lint, unit tests, and Docker build run in parallel
  2. PR merged to main: Docker image pushed to ECR, Helm chart values updated in the GitOps repo
  3. ArgoCD detects the change: Syncs the new manifest to the cluster
  4. Progressive rollout: Argo Rollouts handles canary deployment — 10% traffic for 5 minutes, automated metric checks, then full rollout
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: payment-service
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-service-vsvc
            routes:
              - primary

The analysis step is critical. It queries Prometheus for the service's error rate and p99 latency. If either degrades beyond a threshold compared to the stable version, the rollout automatically aborts and rolls back. We have caught production issues with this that would have otherwise required manual intervention at 2 AM.

Service Mesh: Istio's Tradeoffs

We adopted Istio early and I have a love-hate relationship with it. The benefits are real: mutual TLS everywhere, fine-grained traffic routing, observability through distributed tracing, and circuit breaking. The costs are also real: Envoy sidecars add roughly 50-80MB of memory overhead per pod (multiply that by 1000+ pods and it is meaningful), and debugging networking issues through the mesh adds a layer of complexity that has burned us multiple times.

One specific lesson: do not enable Istio injection on every namespace from day one. We started by injecting sidecars into our batch processing jobs, which was pointless. Those jobs talk to a message queue and a database — they do not benefit from service mesh features. We now selectively enable injection only for services that participate in synchronous service-to-service communication.

If I were starting from scratch today, I would seriously evaluate whether a simpler alternative like Linkerd or even just Cilium's networking policies would cover 80% of our needs at 20% of the operational complexity.

The Namespace Strategy That Saved Us

Early on, everything lived in the default namespace. Then we moved to one namespace per service (which was too granular), and finally settled on namespace-per-team with resource quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-payments-quota
  namespace: payments
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    services: "20"

This gives each team autonomy within boundaries. The payments team can deploy whatever they want in their namespace, but they cannot accidentally consume the entire cluster's resources. When a team needs more, they submit a PR to the Terraform config and we review it — a conversation about capacity, not a fire drill.

Distributed Systems Realities

Running 200+ microservices means dealing with distributed systems problems constantly. A few patterns that we enforce across all teams:

Timeouts and retries on every outbound call. We do not allow any service to make an HTTP call without an explicit timeout. The default timeout in most HTTP clients is either infinity or 30 seconds — both are wrong. We standardize on 3 seconds for synchronous calls, with one retry and exponential backoff.

Circuit breakers at the mesh level. Istio's DestinationRule handles this for us. If a downstream service starts returning 5xx errors, the circuit opens and callers get a fast failure instead of waiting for timeouts:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: inventory-service-circuit-breaker
spec:
  host: inventory-service
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 50

Structured logging everywhere. Every service emits JSON logs with a consistent schema — trace ID, service name, environment, and request metadata. These flow into Loki, and our Grafana dashboards let anyone trace a request across ten services in under a minute.

Observability Is Not Optional

At this scale, you cannot debug production issues by reading logs. You need metrics, traces, and dashboards that tell you where the problem is before you even start looking at code.

Our observability stack:

  • Prometheus for metrics, with service-level SLOs defined for every critical path
  • Grafana for dashboards (we have about 60 dashboards, and maybe 15 of them are actually useful)
  • Loki for log aggregation
  • Jaeger for distributed tracing, integrated through Istio's telemetry

The single most useful dashboard we built shows the "golden signals" for every service: request rate, error rate, and latency (p50, p95, p99). When something breaks, the on-call engineer opens this dashboard first and can usually identify the offending service within 30 seconds.

Cost Management: The Boring Essential

Cloud native at scale means cloud bills at scale. A few things that keep costs in check:

  • Spot instances for non-critical workloads. Our staging cluster runs entirely on Spot, and even in production, stateless services that can tolerate restarts run on Spot nodes with proper pod disruption budgets.
  • Karpenter consolidation. Karpenter can consolidate pods onto fewer nodes during low-traffic periods and decommission empty nodes. This alone saves us $8-10K per month.
  • Right-sized persistent volumes. We had EBS volumes provisioned at 500Gi that were using 30Gi. A quarterly audit of PVCs has become a ritual.

What I Would Do Differently

If I were building this system from scratch:

  1. Start with a platform team from day one. We bolted on platform engineering after the fact, and it cost us a year of catch-up.
  2. Invest in developer experience early. If deploying a new service takes more than 30 minutes from git init to production, something is wrong. We built internal CLIs and templates that got this down to 15 minutes.
  3. Fewer, larger services. Not every bounded context needs its own deployment. We have services that handle 2 requests per minute and have all the operational overhead of a service doing 2000 per second. The sweet spot is probably around 50-80 services for our domain, not 200+.
  4. Budget for observability from the start. Our Prometheus and Loki stack costs about $2K per month. It pays for itself every time it helps us find a bug in 5 minutes instead of 5 hours.

Running Kubernetes at scale is an operational commitment, not a technology choice. The tools are mature enough that the technical problems are largely solved. What remains is the organizational discipline to use them well — and that is the harder problem. If you are interested in how we handle the ML workloads that run on this infrastructure, check out Building Production-Ready RAG Pipelines with LangChain and Fine-Tuning LLMs for Domain-Specific Applications.

Related Posts