Context
A payments operator in Australia and APAC needed tighter clearing windows. Tail-latency jitter caused occasional misaligned reconciliation files, while engineering and finance still disagreed on the definition of a “successful” transaction—observability did not map cleanly to ledger accounts.
Constraints
Peak-hour synchronous chains across ledger, risk, and notification tiers exhausted connection pools when a single hop timed out. Compensation paths and idempotency keys were inconsistent, so rare duplicate messages created duplicate postings. Logs lacked a single correlation identifier, stretching mean time to recovery.
What we did
We split the critical path into parallelisable stages, decoupled authorisation from capture, and moved ledger posting toward an append-only, event-sourced model with end-of-day true-up tasks. External calls gained deduplication keys, configurable backoff, and dead-letter queues with dashboards. Gateways and batch ingress inject trace_id and settlement_batch_id into structured logs. With finance we defined an explicit “postable” state machine binding transitions to journal templates—no informal “half-success” states from engineering.
Outcomes
Load simulations in pre-prod preceded a canary read/dual-write production cutover. p99 tail latency dropped and stayed stable on business peaks. Reconciliation tickets fell into an acceptable band within three cycles, each discrepancy traceable to subsystem and batch identifiers. Runbooks now list rollback triggers and finance escalation contacts for audit traceability.

