What does Amar Sohail specialize in?

Amar Sohail specializes in AI/ML engineering, generative AI, agentic AI systems, RAG pipelines, multi-agent orchestration, backend architecture, and MLOps. He has 10+ years of experience building scalable systems with Python, Java, Go, LangChain, PyTorch, Kubernetes, and Terraform.

Can Amar build a RAG pipeline for my business?

Yes. Amar has designed and deployed production-grade RAG pipelines using LangChain, LlamaIndex, and vector databases like Pinecone, Weaviate, and ChromaDB. He builds enterprise knowledge retrieval systems with optimized chunking strategies, embedding models, and retrieval mechanisms.

What AI/ML frameworks does Amar work with?

Amar works with TensorFlow, PyTorch, scikit-learn, LangChain, LlamaIndex, Hugging Face Transformers, OpenAI API, Gemini, and Claude. For agentic AI, he uses CrewAI, LangGraph, AutoGen, OpenAI Swarm, MCP (Model Context Protocol), and Claude Code.

How much experience does Amar have with Kubernetes?

Amar has extensive Kubernetes experience, having scaled clusters running 200+ microservices on EKS, GKE, and AKS. He implements GitOps CI/CD with ArgoCD, infrastructure as code with Terraform, and observability with Prometheus and Grafana.

Does Amar offer freelance AI engineering services?

Yes. Amar Sohail is available for freelance AI engineering, consulting, and contract work. He offers services in RAG pipeline development, LLM fine-tuning, agentic AI system design, backend architecture, and MLOps. Based in Dubai, UAE, he works with clients worldwide. Book a call at calendly.com/amarsohail/30min.

What is Amar Sohail's tech stack?

Amar's tech stack includes Python, Java, Go, and TypeScript for languages; Spring Boot, Django, FastAPI, and NestJS for backend; TensorFlow, PyTorch, and LangChain for AI/ML; Docker, Kubernetes, and Terraform for infrastructure; and PostgreSQL, MongoDB, Redis, and Elasticsearch for databases. He deploys across AWS, Azure, and GCP.

Can Amar build agentic AI systems and multi-agent orchestration?

Yes. Amar builds multi-agent orchestration systems using CrewAI, LangGraph, AutoGen, and OpenAI Swarm. He implements agentic workflows with MCP (Model Context Protocol) and Claude Code, including persistent memory, human-in-the-loop patterns, and autonomous task execution.

Where is Amar Sohail based and does he work remotely?

Amar Sohail is based in Lahore, Pakistan, and is available for remote work worldwide. He has worked with clients across the US, Europe, and the Middle East on projects ranging from backend engineering to AI/ML systems.

What is KeyPact PQ-Mail Proxy?

KeyPact PQ-Mail Proxy is a transparent post-quantum secure email encryption proxy built in Rust by Amar Sohail, commissioned by KeyPact B.V. (Netherlands). It is the first working system combining ML-KEM-1024 post-quantum key encapsulation with a Double Ratchet forward secrecy protocol in a transparent email proxy. It works with standard email clients like Thunderbird, Outlook, and Apple Mail without requiring any modifications to existing email infrastructure.

Dataism is an end-to-end AI content automation platform built by Amar Sohail for automated character and scene generation. It features a FastAPI backend, Next.js dashboard, ComfyUI engine with 30+ workflows including Flux LoRA training, SDXL generation, voice cloning, and video generation. It processes 100 characters/hour with 48+ hours of continuous stability.

What is Amar Sohail's current role?

Amar Sohail is currently CTO at AllysAI, an AI Lab-as-a-Service provider based in Brooklyn, Paris, and Abu Dhabi. He defines technical and research strategy, leads multi-agent orchestration framework deployment using LangChain, LangGraph, and MCP, oversees ETL pipeline engineering, and drives AI workflow automation. Previously, he served as Engineering Lead & Tech Lead at Giisty and Hyves.co in Dubai.

Terraform Best Practices for Multi-Cloud Infrastructure

TL;DR

I will be honest: nobody sets out to be multi-cloud on purpose. For us, it happened organically. Our primary workloads ran on AWS, but an acquisition brought in a team deeply invested in Azure, and our ML/AI pipeline team had standardized on GCP's Vertex AI. Within six months, we were managing infrastructure across three clouds with a patchwork of ClickOps, cloud-specific CLIs, and a handful of CloudFormation templates that nobody wanted to touch.

The Multi-Cloud Reality

When I took ownership of our infrastructure platform, the first decision was straightforward: Terraform would be our single control plane. Not Pulumi, not Crossplane, not cloud-native IaC tools. Terraform, because it had the broadest provider ecosystem and the largest pool of engineers who already knew it.

What follows is not a Terraform tutorial. It is the set of patterns and guardrails that let a team of 8 platform engineers manage 1,400+ resources across three clouds without losing their minds.

Module Design: The Layered Approach

Early on, we made the classic mistake of writing monolithic Terraform configurations -- one massive main.tf per environment with hundreds of resources. Changing a security group rule required a terraform plan that took 9 minutes and touched resources it had no business evaluating.

We restructured around a three-layer module architecture:

Layer 1: Primitive Modules

These wrap individual cloud resources with our organizational defaults baked in. They are thin, opinionated, and cloud-specific.

# modules/aws/vpc/main.tf
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = var.vpc_name
  cidr = var.cidr_block

  azs             = var.availability_zones
  private_subnets = var.private_subnet_cidrs
  public_subnets  = var.public_subnet_cidrs

  enable_nat_gateway     = true
  single_nat_gateway     = var.environment == "dev" ? true : false
  enable_dns_hostnames   = true
  enable_dns_support     = true

  tags = merge(var.tags, {
    ManagedBy   = "terraform"
    Environment = var.environment
    CostCenter  = var.cost_center  # FinOps tag — mandatory
  })
}

Notice the CostCenter tag. We enforce this through a Terraform validation rule -- no resource can be created without a cost allocation tag. This was a FinOps decision that paid for itself within two months when we discovered a dev environment running GPU instances that were costing us $14,000/month.

Layer 2: Composite Modules

These compose primitive modules into logical platform components. A "backend service" module, for example, provisions a VPC, an EKS node group, an RDS instance, and the IAM roles to glue them together.

# modules/platform/backend-service/main.tf
module "networking" {
  source      = "../../aws/vpc"
  vpc_name    = "${var.service_name}-${var.environment}"
  cidr_block  = var.cidr_block
  environment = var.environment
  cost_center = var.cost_center
  tags        = local.common_tags
}

module "database" {
  source              = "../../aws/rds-postgres"
  instance_class      = var.db_instance_class
  allocated_storage   = var.db_storage_gb
  subnet_ids          = module.networking.private_subnet_ids
  vpc_security_group_ids = [module.networking.database_sg_id]
  environment         = var.environment
  cost_center         = var.cost_center
}

module "kubernetes" {
  source          = "../../aws/eks-nodegroup"
  cluster_name    = var.eks_cluster_name
  node_group_name = var.service_name
  subnet_ids      = module.networking.private_subnet_ids
  instance_types  = var.node_instance_types
  desired_size    = var.environment == "prod" ? 3 : 1
  cost_center     = var.cost_center
}

Layer 3: Environment Configurations

These are the actual deployments -- thin wrappers that call composite modules with environment-specific variables. Almost no resource definitions live here, just module calls and variable assignments.

# environments/production/us-east-1/backend/main.tf
module "order_service" {
  source = "../../../../modules/platform/backend-service"

  service_name     = "order-service"
  environment      = "prod"
  cost_center      = "platform-eng"
  cidr_block       = "10.1.0.0/16"
  eks_cluster_name = "prod-us-east-1"
  db_instance_class = "db.r6g.xlarge"
  db_storage_gb     = 500
  node_instance_types = ["m6i.xlarge"]
}

This layering means a developer deploying a new service does not need to understand VPC CIDR planning or IAM policy syntax. They fill in a module call, open a PR, and the platform team reviews the plan output.

State Management: The One That Bites Everyone

Terraform state is where most multi-cloud setups fall apart. We went through three state management strategies before finding one that works at scale.

What failed: A single S3 backend with path-based workspaces. State file locking conflicts were constant, plan times ballooned as the state grew, and a corrupted state file for one service could block deployments for everything.

What works: One state file per deployment unit, stored in the same cloud as the resources it manages.

# environments/production/us-east-1/backend/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state-prod"
    key            = "us-east-1/backend/order-service/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

# environments/production/azure-eastus/ml-pipeline/backend.tf
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "companyterraformstate"
    container_name       = "tfstate"
    key                  = "azure-eastus/ml-pipeline/terraform.tfstate"
  }
}

Each cloud's resources have their state stored in that cloud's native storage. AWS resources use S3, Azure resources use Azure Blob Storage, GCP resources use GCS. This eliminates cross-cloud dependencies in the state layer and means a cloud provider outage only affects deployments to that cloud.

We currently manage around 60 independent state files. That sounds like a lot, but each one is small, fast to plan, and independently lockable. Our average terraform plan dropped from 9 minutes to under 40 seconds.

CI/CD Pipeline: The Guardrails

Our CI/CD pipeline for Terraform runs in GitHub Actions, and we treat terraform apply like a production deployment -- because it is one.

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths: ["environments/**", "modules/**"]
  push:
    branches: [main]
    paths: ["environments/**", "modules/**"]

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      changed_dirs: ${{ steps.changes.outputs.dirs }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        run: |
          dirs=$(git diff --name-only origin/main... | \
            grep '^environments/' | \
            xargs -I {} dirname {} | \
            sort -u | jq -R -s -c 'split("\n") | map(select(. != ""))')
          echo "dirs=$dirs" >> $GITHUB_OUTPUT

  plan:
    needs: detect-changes
    runs-on: ubuntu-latest
    strategy:
      matrix:
        dir: ${{ fromJson(needs.detect-changes.outputs.changed_dirs) }}
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
        working-directory: ${{ matrix.dir }}
      - run: terraform plan -out=tfplan -no-color
        working-directory: ${{ matrix.dir }}
      - uses: actions/upload-artifact@v4
        with:
          name: plan-${{ hashFiles(matrix.dir) }}
          path: ${{ matrix.dir }}/tfplan

Key guardrails in this pipeline:

Change detection scopes plans. We only run terraform plan on directories that actually changed. This keeps PR feedback loops under 2 minutes.
Plan artifacts are saved and reused for apply. The apply job downloads the exact plan artifact that was reviewed, so there is no drift between what was approved and what gets applied.
No manual applies. Engineers cannot run terraform apply from their laptops. This is enforced by IAM policies that restrict write permissions to the CI service account.

FinOps: Making Costs Visible Before They Hit the Bill

This is where most infrastructure teams drop the ball. Writing Terraform to provision resources is easy. Understanding what those resources will cost before you provision them is where real value lies.

We integrated Infracost into our CI pipeline so every PR that modifies infrastructure shows a cost estimate in the PR comment:

# We also use Terraform variables with validation to prevent cost blowups
variable "db_instance_class" {
  type = string
  validation {
    condition = contains([
      "db.t3.medium", "db.t3.large",
      "db.r6g.large", "db.r6g.xlarge", "db.r6g.2xlarge"
    ], var.db_instance_class)
    error_message = "Instance class not in approved list. File a request in #platform-eng for exceptions."
  }
}

variable "node_instance_types" {
  type = list(string)
  validation {
    condition = alltrue([
      for t in var.node_instance_types :
      !can(regex("^(p[0-9]|g[0-9]|x[0-9])", t))
    ])
    error_message = "GPU and high-memory instances require FinOps approval. See runbook/finops-exceptions."
  }
}

That second validation rule prevents anyone from accidentally provisioning GPU instances (the p, g, and x families) without going through our FinOps approval process. This single validation saved us over $40,000 in the first quarter after we added it, mostly from dev/staging environments where engineers were testing ML workloads on p3.2xlarge instances and forgetting to tear them down.

Cross-Cloud Networking: The Hard Part

The least-discussed challenge in multi-cloud Terraform is networking. Our AWS workloads need to talk to Azure-hosted ML services, and our GCP-hosted AI training pipelines need to pull data from AWS S3.

We standardized on a hub-and-spoke model where each cloud has a transit gateway or equivalent, and cross-cloud connectivity goes through dedicated VPN tunnels managed by Terraform:

# modules/cross-cloud/aws-azure-vpn/main.tf
resource "aws_vpn_gateway" "main" {
  vpc_id = var.aws_vpc_id
  tags   = { Name = "vpn-to-azure-${var.environment}" }
}

resource "azurerm_virtual_network_gateway" "main" {
  name                = "vpn-to-aws-${var.environment}"
  location            = var.azure_location
  resource_group_name = var.azure_resource_group

  type     = "Vpn"
  vpn_type = "RouteBased"
  sku      = "VpnGw2"

  ip_configuration {
    public_ip_address_id = azurerm_public_ip.vpn.id
    subnet_id            = var.azure_gateway_subnet_id
  }
}

The trick is managing the shared secrets and IP allocations across providers. We store VPN pre-shared keys in HashiCorp Vault and reference them through the Vault Terraform provider, never in state files or variable definitions.

Lessons After 18 Months

Three things I wish I had known at the start:

Module versioning is essential, not optional. We use a private Terraform module registry backed by our Git monorepo. Every module has a semantic version, and environment configurations pin to specific versions. Unpinned modules are a ticking time bomb.

terraform import is your migration best friend. When we brought the Azure infrastructure under Terraform management, we imported existing resources rather than recreating them. It took two weeks of tedious import commands, but zero downtime.

Invest in developer experience. We built a CLI wrapper called infra that handles backend initialization, workspace selection, and plan formatting. Engineers run infra plan order-service prod instead of navigating directory trees and running raw Terraform commands. Adoption tripled after we shipped it.

Multi-cloud Terraform is a marathon, not a sprint. The patterns I have described here took us 18 months to refine, and we are still iterating. If you are running workloads on Kubernetes across these clouds, I wrote about the operational patterns we use in our Kubernetes scaling post. And if your event-driven systems are part of what you are provisioning, our approach to Kafka infrastructure management covers how the application and infrastructure layers connect.