Context
During sales peaks the warehouse system showed state regressions and duplicate pick alerts; transport tracking produced contradictory nodes when messages arrived out of order; handheld scanners amplified lock contention. Leadership required remediation without cancelling peak events while keeping asynchronous carrier integrations.
Constraints
State transitions were not universally idempotent; retries amplified when downstream rate limits hit; batch jobs shared tables with online traffic causing long transactions. Transport consumers masked disorder with last-writer-wins, forking timelines in reporting. Reservation versus finance confirmation lacked explicit compensating steps.
What we did
We replaced ad-hoc branching with an explicit state machine using version numbers and business idempotency keys; consumers used partition keys to bound ordering; hot and cold paths were split with batch windows moved off the critical path. Contract tests and replay harnesses covered WMS/TMS interfaces. Reservation and release gained queryable audit tables reconciled nightly. Retries gained jitter and caps with downstream token buckets to prevent storms.
Outcomes
No unexplained state forks during peak; retry rates stayed bounded; lock waits improved and shipment-level reconciliation met SLA. Cross-organisation disputes could be tied to message fingerprints and replay artefacts instead of anecdote.

