TL;DR
Last year, our order processing system hit a wall. We had a monolithic Spring Boot application handling around 12,000 orders per hour during peak traffic, and every downstream service -- inventory, payments, shipping, notifications -- was wired together with synchronous REST calls. One slow response from the payment gateway would cascade into request thread exhaustion, and the whole pipeline would grind to a halt. We were spending more time firefighting than shipping features.
Why We Ditched the Monolith's Synchronous Pipeline
Last year, our order processing system hit a wall. We had a monolithic Spring Boot application handling around 12,000 orders per hour during peak traffic, and every downstream service -- inventory, payments, shipping, notifications -- was wired together with synchronous REST calls. One slow response from the payment gateway would cascade into request thread exhaustion, and the whole pipeline would grind to a halt. We were spending more time firefighting than shipping features.
The decision to move to an event-driven architecture was not born from a whiteboard session about "clean architecture." It came from a 3 AM incident where a payment provider's latency spike caused a 47-minute outage that cost us real revenue. That was the catalyst.
I want to share exactly how we rebuilt this system on Apache Kafka and Spring Boot, the mistakes we made along the way, and the patterns that actually held up in production.
Designing the Event Schema
The first major decision was event schema design. We went through three iterations before landing on something stable.
Iteration 1: Fat Events (Don't Do This)
Our initial instinct was to pack everything into the event payload -- full order details, customer profile, shipping address, inventory snapshots. This felt convenient at first because consumers had all the data they needed without making additional API calls.
// This was our first attempt — a fat event with everything embedded
public record OrderCreatedEvent(
String orderId,
CustomerProfile customer, // Full customer object
List<OrderItem> items, // Full product details
ShippingAddress shippingAddress, // Full address object
PaymentDetails paymentDetails, // Full payment info
InventorySnapshot inventory, // Inventory state at order time
Instant createdAt
) {}
Within two weeks, we hit the first problem: schema coupling. When the customer service team changed the CustomerProfile schema, every single consumer broke. We were back to the same tight coupling we had with REST, just asynchronous now.
Iteration 2: Thin Events with References
We swung to the other extreme -- events carrying only IDs and a type discriminator, forcing consumers to hydrate data from source services.
public record OrderCreatedEvent(
String orderId,
String customerId,
List<String> itemIds,
Instant createdAt
) {}
This solved coupling but introduced a new problem: during high-throughput periods, consumers hammered the source services with lookup calls, creating the exact same cascading failure pattern we were trying to escape.
Iteration 3: Right-Sized Events
The pattern that worked was what I call "right-sized events." Each event carries the data that the event itself owns -- the facts about what happened -- plus stable identifiers for cross-referencing.
public record OrderCreatedEvent(
@NotNull String eventId,
@NotNull String orderId,
@NotNull String customerId,
@NotNull List<OrderLineItem> lineItems, // itemId + quantity + priceAtOrder
@NotNull MonetaryAmount orderTotal,
@NotNull Instant occurredAt,
@NotNull String correlationId
) {}
The lineItems include the price at order time (an immutable fact about the event), not the current product price. The customerId is a stable reference -- consumers that need the full profile can look it up and cache it. This distinction between event-owned facts and reference data was the key insight.
We enforced this with Confluent Schema Registry using Avro schemas and BACKWARD compatibility mode. Any breaking schema change gets caught in CI before it reaches a topic.
The Spring Boot Consumer Architecture
Our consumer setup evolved significantly. Early on, we used @KafkaListener with default configurations and learned the hard way that defaults are not production-ready.
Consumer Configuration That Actually Works
@Configuration
public class KafkaConsumerConfig {
@Bean
public ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent>
orderEventListenerFactory(ConsumerFactory<String, OrderCreatedEvent> cf) {
var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent>();
factory.setConsumerFactory(cf);
factory.setConcurrency(3); // Match partition count
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
factory.setCommonErrorHandler(new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate),
new FixedBackOff(1000L, 3) // 3 retries, 1s apart
));
factory.getContainerProperties().setIdleEventInterval(60000L);
return factory;
}
}
Three things here that are non-negotiable in production:
- Manual acknowledgment (
MANUAL_IMMEDIATE). Auto-commit is a data loss vector. If your consumer crashes between auto-commit intervals, you lose messages. - Dead letter topics. After 3 retries, the message goes to a DLT where we have a separate reconciliation service that alerts on-call engineers and attempts reprocessing with human oversight.
- Concurrency matching partition count. We run 3 consumer threads per instance with 12 partitions, deploying 4 instances. Each thread gets exactly one partition, maximizing throughput without rebalancing churn.
Idempotency: The Pattern Nobody Gets Right the First Time
Every Kafka tutorial mentions "make your consumers idempotent." None of them explain what that looks like when your consumer writes to a relational database and calls two downstream APIs.
We use a transactional outbox pattern combined with an idempotency key table:
@Service
@Transactional
public class OrderFulfillmentService {
private final IdempotencyKeyRepository idempotencyRepo;
private final FulfillmentRepository fulfillmentRepo;
public void handleOrderCreated(OrderCreatedEvent event) {
// Check if we've already processed this event
if (idempotencyRepo.existsByEventId(event.eventId())) {
log.info("Duplicate event detected, skipping: {}", event.eventId());
return;
}
// Process the event
var fulfillment = FulfillmentRecord.from(event);
fulfillmentRepo.save(fulfillment);
// Record that we've processed this event — same transaction
idempotencyRepo.save(new IdempotencyKey(event.eventId(), Instant.now()));
}
}
The critical detail: the idempotency check and the business write happen in the same database transaction. If the transaction rolls back, both the business write and the idempotency record roll back together. No partial state.
Exactly-Once Semantics: Harder Than You Think
Kafka's exactly-once semantics (EOS) is a feature I see teams misunderstand constantly. EOS in Kafka guarantees exactly-once within the Kafka ecosystem -- from producer to topic to consumer to topic. The moment your consumer writes to an external system (a database, an API), you are outside the EOS boundary.
We handle this with the pattern above -- idempotent consumers with transactional writes. For the producer side, we enable EOS to prevent duplicate messages from producer retries:
spring:
kafka:
producer:
acks: all
properties:
enable.idempotence: true
max.in.flight.requests.per.connection: 5
transactional.id.prefix: order-service-
consumer:
isolation-level: read_committed
Setting isolation-level: read_committed on consumers ensures they only read messages from committed producer transactions. This eliminated a class of phantom-read bugs we were seeing during load tests.
Monitoring and the Observability Gap
One thing that caught us off guard was the observability gap. With synchronous REST calls, you can trace a request from ingress to response. With events, a single user action might produce 6 events consumed by 9 different services over a span of 30 seconds.
We solved this with correlation IDs propagated through every event. Every event carries a correlationId that traces back to the original user action. We pipe these into our OpenTelemetry setup, and our Grafana dashboards show the full event flow for any given order.
We also built a lightweight Python service using FastAPI that monitors consumer lag across all consumer groups and pushes alerts when lag exceeds thresholds:
from fastapi import FastAPI
from confluent_kafka.admin import AdminClient
import asyncio
app = FastAPI()
ALERT_THRESHOLD_SECONDS = 120
@app.get("/consumer-lag/{group_id}")
async def get_consumer_lag(group_id: str):
admin = AdminClient({"bootstrap.servers": "kafka:9092"})
offsets = admin.list_consumer_group_offsets([group_id])
# Compare committed offsets to high watermarks
lag_data = calculate_lag(offsets)
if lag_data["estimated_seconds_behind"] > ALERT_THRESHOLD_SECONDS:
await send_pagerduty_alert(group_id, lag_data)
return lag_data
This service became one of our most critical monitoring tools. Consumer lag is the single most important metric in an event-driven system -- it tells you if your consumers are keeping up, and it is the earliest warning sign of a problem.
What I Would Do Differently
If I were starting this migration again, three things would change:
Start with fewer topics. We over-partitioned early on, creating a topic per entity per action (order.created, order.updated, order.cancelled, order.item.added...). The operational overhead of managing 40+ topics was brutal. I would start with coarser-grained topics and split later based on actual consumer access patterns.
Invest in a schema governance process from day one. Schema Registry catches breaking changes, but it does not enforce good design. We now have a lightweight RFC process where any new event schema requires a one-page design doc reviewed by at least one consumer team. This has prevented more bugs than any technical control.
Build the dead letter queue reconciliation tooling before going to production. We launched without it and spent the first two months manually replaying failed messages through ad-hoc scripts. The DLQ reconciliation dashboard we eventually built was one of the highest-ROI pieces of tooling in the entire migration.
The Results
Six months post-migration, our numbers tell the story: order processing throughput increased 4x, end-to-end latency dropped from a p99 of 12 seconds to under 800 milliseconds, and we have not had a single cascading failure incident. The system handled Black Friday traffic with zero intervention.
Event-driven architecture is not a silver bullet. It introduces complexity in debugging, testing, and schema management that synchronous systems simply do not have. But for high-throughput, loosely-coupled systems where availability matters more than simplicity, it is the right trade-off. If you are evaluating a similar move, I recommend reading about how we handle infrastructure-as-code for the underlying Kafka clusters and how we manage the microservices that sit on top of this event backbone.



