TL;DR
When I tell people we run over 200 microservices on Kubernetes, the first question is usually "why that many?" The honest answer is that it did not start at 200. It started at 12, then grew organically as teams shipped features, and by the time we looked up, we had a sprawling distributed system with its own gravitational pull. This post is about the lessons we learned making that system reliable — and the mistakes that taught us those lessons the hard way.
When I tell people we run over 200 microservices on Kubernetes, the first question is usually "why that many?" The honest answer is that it did not start at 200. It started at 12, then grew organically as teams shipped features, and by the time we looked up, we had a sprawling distributed system with its own gravitational pull. This post is about the lessons we learned making that system reliable — and the mistakes that taught us those lessons the hard way.
The Cluster Architecture
We run on AWS EKS across three availability zones. The infrastructure is fully managed through Terraform, and at this point our IaC repository is one of the most critical codebases in the company. Our cluster topology looks roughly like this:
- 3 node groups: general workloads (m6i.xlarge), memory-intensive services (r6i.2xlarge), and GPU nodes for our ML inference workloads (g5.xlarge)
- Karpenter for node autoscaling, replacing the older Cluster Autoscaler
- Istio as the service mesh
- ArgoCD for GitOps-based deployments
- Prometheus + Grafana + Loki for the observability stack
# Terraform EKS module configuration (simplified)
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "prod-main"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
instance_types = ["m6i.xlarge"]
min_size = 6
max_size = 30
desired_size = 12
labels = {
workload-type = "general"
}
}
memory_optimized = {
instance_types = ["r6i.2xlarge"]
min_size = 2
max_size = 10
desired_size = 4
labels = {
workload-type = "memory-intensive"
}
taints = [{
key = "workload-type"
value = "memory-intensive"
effect = "NO_SCHEDULE"
}]
}
}
}
This setup supports roughly 800-1200 pods at steady state, spiking to 2000+ during peak traffic.
Resource Requests and Limits: The Silent Killer
If I could go back and fix one thing from day one, it would be enforcing sane resource requests and limits. For the first year, most teams set their resource requests based on vibes. Services would request 2 CPU cores and 4Gi of memory because those were round numbers, not because anyone profiled the application.
The result: massive overprovisioning. We were paying for 3x the compute we actually needed. Pods were spread across too many nodes because Kubernetes thought they were hungrier than they were, and the scheduler could not bin-pack efficiently.
We fixed this in three steps:
Step 1: Profile Everything with VPA
We deployed the Vertical Pod Autoscaler in recommendation mode across every namespace:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: order-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
updatePolicy:
updateMode: "Off" # Recommendation only, no auto-updates
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
After two weeks of data collection, we had actual usage profiles for every service. The gap between requested and actual usage was staggering — most services used less than 20% of their requested CPU.
Step 2: Right-size and Enforce
We wrote a script that generated new resource manifests based on VPA recommendations, adding a 30% buffer above the P99 usage. Then we added an OPA Gatekeeper policy that rejected any deployment where requests exceeded twice the VPA recommendation without an explicit exemption.
Step 3: Switch to Karpenter
The old Cluster Autoscaler worked at the node group level and was slow to react. Karpenter makes provisioning decisions per-pod and can mix instance types intelligently. This alone saved us about 35% on our EC2 bill because Karpenter would provision a c6i.large for a CPU-bound pod instead of always defaulting to the node group's m6i.xlarge.
CI/CD Pipelines: Getting to 50 Deploys Per Day
With 200+ services and a dozen teams, the deployment pipeline has to be fast and trustworthy. Nobody should be scared to deploy on a Thursday afternoon.
Our CI/CD pipeline runs on GitHub Actions for build and test, then hands off to ArgoCD for deployment. Here is the rough flow:
- PR opened: Lint, unit tests, and Docker build run in parallel
- PR merged to main: Docker image pushed to ECR, Helm chart values updated in the GitOps repo
- ArgoCD detects the change: Syncs the new manifest to the cluster
- Progressive rollout: Argo Rollouts handles canary deployment — 10% traffic for 5 minutes, automated metric checks, then full rollout
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: payment-service
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
canaryService: payment-service-canary
stableService: payment-service-stable
trafficRouting:
istio:
virtualService:
name: payment-service-vsvc
routes:
- primary
The analysis step is critical. It queries Prometheus for the service's error rate and p99 latency. If either degrades beyond a threshold compared to the stable version, the rollout automatically aborts and rolls back. We have caught production issues with this that would have otherwise required manual intervention at 2 AM.
Service Mesh: Istio's Tradeoffs
We adopted Istio early and I have a love-hate relationship with it. The benefits are real: mutual TLS everywhere, fine-grained traffic routing, observability through distributed tracing, and circuit breaking. The costs are also real: Envoy sidecars add roughly 50-80MB of memory overhead per pod (multiply that by 1000+ pods and it is meaningful), and debugging networking issues through the mesh adds a layer of complexity that has burned us multiple times.
One specific lesson: do not enable Istio injection on every namespace from day one. We started by injecting sidecars into our batch processing jobs, which was pointless. Those jobs talk to a message queue and a database — they do not benefit from service mesh features. We now selectively enable injection only for services that participate in synchronous service-to-service communication.
If I were starting from scratch today, I would seriously evaluate whether a simpler alternative like Linkerd or even just Cilium's networking policies would cover 80% of our needs at 20% of the operational complexity.
The Namespace Strategy That Saved Us
Early on, everything lived in the default namespace. Then we moved to one namespace per service (which was too granular), and finally settled on namespace-per-team with resource quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-payments-quota
namespace: payments
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services: "20"
This gives each team autonomy within boundaries. The payments team can deploy whatever they want in their namespace, but they cannot accidentally consume the entire cluster's resources. When a team needs more, they submit a PR to the Terraform config and we review it — a conversation about capacity, not a fire drill.
Distributed Systems Realities
Running 200+ microservices means dealing with distributed systems problems constantly. A few patterns that we enforce across all teams:
Timeouts and retries on every outbound call. We do not allow any service to make an HTTP call without an explicit timeout. The default timeout in most HTTP clients is either infinity or 30 seconds — both are wrong. We standardize on 3 seconds for synchronous calls, with one retry and exponential backoff.
Circuit breakers at the mesh level. Istio's DestinationRule handles this for us. If a downstream service starts returning 5xx errors, the circuit opens and callers get a fast failure instead of waiting for timeouts:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inventory-service-circuit-breaker
spec:
host: inventory-service
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 60s
maxEjectionPercent: 50
Structured logging everywhere. Every service emits JSON logs with a consistent schema — trace ID, service name, environment, and request metadata. These flow into Loki, and our Grafana dashboards let anyone trace a request across ten services in under a minute.
Observability Is Not Optional
At this scale, you cannot debug production issues by reading logs. You need metrics, traces, and dashboards that tell you where the problem is before you even start looking at code.
Our observability stack:
- Prometheus for metrics, with service-level SLOs defined for every critical path
- Grafana for dashboards (we have about 60 dashboards, and maybe 15 of them are actually useful)
- Loki for log aggregation
- Jaeger for distributed tracing, integrated through Istio's telemetry
The single most useful dashboard we built shows the "golden signals" for every service: request rate, error rate, and latency (p50, p95, p99). When something breaks, the on-call engineer opens this dashboard first and can usually identify the offending service within 30 seconds.
Cost Management: The Boring Essential
Cloud native at scale means cloud bills at scale. A few things that keep costs in check:
- Spot instances for non-critical workloads. Our staging cluster runs entirely on Spot, and even in production, stateless services that can tolerate restarts run on Spot nodes with proper pod disruption budgets.
- Karpenter consolidation. Karpenter can consolidate pods onto fewer nodes during low-traffic periods and decommission empty nodes. This alone saves us $8-10K per month.
- Right-sized persistent volumes. We had EBS volumes provisioned at 500Gi that were using 30Gi. A quarterly audit of PVCs has become a ritual.
What I Would Do Differently
If I were building this system from scratch:
- Start with a platform team from day one. We bolted on platform engineering after the fact, and it cost us a year of catch-up.
- Invest in developer experience early. If deploying a new service takes more than 30 minutes from
git initto production, something is wrong. We built internal CLIs and templates that got this down to 15 minutes. - Fewer, larger services. Not every bounded context needs its own deployment. We have services that handle 2 requests per minute and have all the operational overhead of a service doing 2000 per second. The sweet spot is probably around 50-80 services for our domain, not 200+.
- Budget for observability from the start. Our Prometheus and Loki stack costs about $2K per month. It pays for itself every time it helps us find a bug in 5 minutes instead of 5 hours.
Running Kubernetes at scale is an operational commitment, not a technology choice. The tools are mature enough that the technical problems are largely solved. What remains is the organizational discipline to use them well — and that is the harder problem. If you are interested in how we handle the ML workloads that run on this infrastructure, check out Building Production-Ready RAG Pipelines with LangChain and Fine-Tuning LLMs for Domain-Specific Applications.

