Development
7 min readEvent-driven architectures handle failures through retry strategies. Networks drop packets. Databases lock up. APIs return 500 errors during deployments. The retry implementation determines whether these transient issues resolve gracefully or require manual intervention.
Order fulfillment sequences have different requirements than product catalog indexing. One demands strict message ordering to prevent data corruption. The other prioritizes high throughput over sequence. Applying the same retry approach to both creates either consistency problems or performance bottlenecks.
This article examines two retry approaches: blocking and non-blocking. We'll cover when each makes sense, how to prevent duplicate processing, and Dead Letter Queue strategies. These patterns apply whether you're building a platform from scratch or evaluating existing solutions.
A log entry failure creates a dashboard gap. An order creation message failure means lost revenue and a customer who completed checkout but has no order record. Different stakes require different handling.
Payment must be captured before inventory allocation. Inventory must be allocated before fulfillment triggers. Shipping notifications require existing shipments. These dependencies constrain retry approaches to prevent data corruption and financial losses.
External system failures complicate things further. Payment gateways go down. Tax APIs hit rate limits. Shipping carriers deploy updates that break integrations. Retry strategies need to handle these failures without cascading problems.
The Broadleaf framework supports both blocking and non-blocking retry strategies and implementations as well as provides support for workflows requiring exactly-once processing using the IdempotentMessageConsumptionService.
When product price updates fail to index in search, that failure doesn't need to block subsequent product indexing. The products are independent.
Non-blocking retry forwards failed messages to separate retry topics while the main consumer continues processing. Different backoff intervals can be configured: quick retries for transient issues, longer waits for operations needing recovery time.
Search Implementation: The Search service primarily uses RetryableOperationUtil (an in-memory, blocking retry mechanism) or standard blocking retries (via BlockingRetryProperties), not the Kafka-based non-blocking infrastructure.
Email notifications have similar characteristics. SMTP issues preventing one order confirmation shouldn't affect shipping notifications for other orders. The failed email retries independently.
Marketing automation, analytics, and recommendation updates benefit from this approach when designed for eventual consistency. Seconds or minutes of delay don't break functionality.
Tradeoffs exist. Messages are no longer processed in order. Failed messages that retry may complete after later-arriving messages that succeeded immediately. For workflows where sequence determines correctness, this creates problems.
Additional infrastructure is required: more Kafka topics, separate consumer groups, and configuration management. Idempotency implementation becomes critical because timing windows can cause the same message to reach both main and retry consumers.
Order fulfillment requires order processing. Payment capture before inventory allocation. Inventory allocation before fulfillment. Fulfillment before shipping notifications. The sequence preserves correctness.
Blocking retry stops at the failed message, retries it, then proceeds. Payment capture fails, consumer blocks and retries until success, then processes the next message.
A Kafka implementation detail matters: blocking on a failed message can trigger consumer rebalancing if retry backoff exceeds max.poll.interval.ms. The consumer appears dead, Kafka rebalances the partition, compounding the issue.
Broadleaf's implementation uses seek-to-current-offset by default. When a message fails, the consumer repositions to that offset rather than waiting passively. This allows extended backoff without triggering rebalances. The consumer remains active while working on a message that needs a retry time.
Configuration is through broadleaf.messaging.blocking.retry.cloud.stream properties. The documentation recommends seek-to-current-offset for extended backoff durations with the Kafka binder.
Inventory adjustments need this ordering. A failed inventory decrement for a sold item must be completed before processing the next update for that SKU. Otherwise, race conditions enable overselling.
Cart-to-order conversion works similarly. The cart gets locked, validated, and converted before other operations proceed. Non-blocking retry could allow cart modifications during conversion retry, creating an irreconcilable state.
Throughput costs are real. One slow message blocks subsequent messages in the partition. Slow payment gateway responses delay all orders in that partition. For workflows where ordering determines correctness, this tradeoff is necessary.
Network timeouts create ambiguity. An order service calls the payment service to capture funds. The payment succeeds, the database writes, then the network fails before the response returns.
From the order service perspective, the operation failed. Retrying after a successful payment charges the customer twice. Not retrying after a failed payment loses the sale. The timeout doesn't indicate which occurred.
Idempotent processing ensures multiple message handlings produce the same result as a single handling. Broadleaf's IdempotentMessageConsumptionService creates lock records using unique identifiers: entity ID plus operation type, like "order-12345-created".
First run: check for lock (none exists), create one marked "processing", execute work, update to "complete". Retry: check for lock (exists, marked "complete"), skip processing, acknowledge message. Identical outcome whether processed once or multiple times.
Lock takeover handles catastrophic failures. A JVM crash mid-processing leaves the lock in "processing" status with no active work. The system compares the lock's creation timestamp against a configurable stagnation threshold. Stagnant locks get released, allowing retry attempts to proceed with fresh locks.
Idempotency keys need careful design. Orders: order ID plus operation. Payments: gateway transaction ID. Inventory: adjustment ID plus SKU. Notifications: customer ID, notification type, plus time window.
Performance costs exist. Every message consumption checks for existing locks. High-volume workflows need performant lock storage. For financial operations, duplicate processing costs exceed lock check latency.
IdempotentMessageConsumptionService applies to workflows needing exactly-once processing. Operations that are naturally idempotent or where duplicate processing is harmless can skip this overhead.
Some messages won't succeed. Payment method declined. Non-existent SKU. Malformed email address. After exhausting retries, these messages need routing beyond infinite loops.
Dead Letter Queues capture permanently failed messages for inspection rather than discarding them or blocking the stream.
DLQ message urgency varies. Failed payment captures represent uncollected revenue. Failed order creations mean customers have checkout proof but no order record. These require immediate attention.
Failed marketing emails can wait. Failed analytics events get logged. Failed cache invalidations resolve when the cache expires. The DLQ strategy should match business impact.
Configuration typically involves multiple retry tiers with different backoff intervals before DLQ routing. Quick retries catch transient network issues. Longer waits handle brief outages. Extended backoffs accommodate maintenance windows.
Monitoring is critical. Healthy systems have nearly empty DLQs. Messages land there only for exceptional circumstances. Growing DLQs indicate systemic problems: degraded payment gateways, deployment bugs, and integration issues. The DLQ functions as a safety net and early warning system.
Order fulfillment: blocking retry maintains sequence correctness.
Payment operations: blocking retry with timeout configuration manages time-sensitive authorization windows while maintaining order.
Email notifications: non-blocking retry handles independent, high-volume operations that tolerate delays.
Inventory synchronization varies. Real-time checkout updates need blocking to prevent overselling. Bulk warehouse imports can use non-blocking when treating the batch as one logical unit.
Analytics events: non-blocking works for eventual consistency, where order rarely affects aggregates and volume demands throughput.
Design questions: Does message order affect correctness? Can downstream systems handle out-of-order delivery? What's acceptable latency? Is the operation naturally idempotent? What's the business impact of failure?
Broadleaf provides both blocking and non-blocking patterns with configurable defaults. Non-blocking retry for prioritizing throughput and Blocking retry with seek-to-current-offset for maintaining sequence. IdempotentMessageConsumptionService supports workflows requiring exactly-once semantics.
This messaging resiliency support was added in release train versions 2.0.4/2.1.3.
Retry strategy affects reliability, consistency, and operational overhead. Payment gateways experience latency during high-traffic events. Network partitions temporarily isolate services. Downstream APIs hit rate limits. These situations occur in production eCommerce environments.