Circuit Breaker Pattern with Kafka: A Complete Guide - Part 3

>by Roman Tsyupryk
>

Circuit Breaker Pattern with Kafka: A Complete Guide

Part 3: Challenges, Edge Cases, and Alternatives

Series Navigation:


The Challenges You'll Face

While the circuit breaker + Kafka pause pattern is powerful, it comes with real challenges. Understanding these upfront will save you from production surprises.


Challenge 1: Consumer Group Rebalancing

When you pause a Kafka consumer, you're interacting with Kafka's consumer group protocol. This can cause unexpected behavior.

The Problem

Timeline:
─────────────────────────────────────────────────────────▢

T0: Consumer A owns partitions 0,1,2
    Consumer B owns partitions 3,4,5
    Circuit is CLOSED

T1: Downstream fails, circuit OPENS
    Consumer A calls pause()
    Consumer B calls pause()

T2: Consumer A's pause takes longer
    Kafka thinks Consumer A is dead
    Initiates rebalance

T3: Partitions 0,1,2 reassigned to Consumer B
    Consumer B has OPEN circuit too
    But it's a NEW consumer for these partitions

T4: Confusion about which offsets to resume from

Why This Happens

  • pause() stops fetching, but consumer must still send heartbeats
  • If processing is blocked AND pause takes too long, heartbeat fails
  • Kafka coordinator thinks consumer is dead
  • Rebalance triggers

Solutions

Solution 1: Static Membership (Kafka 2.3+)

Static membership preserves consumer identity across restarts:

spring:
  kafka:
    consumer:
      group-instance-id: consumer-instance-1  # Unique per instance
      session-timeout-ms: 300000  # 5 minutes

Benefits:

  • Consumer can pause for extended periods
  • No rebalance during pause
  • Identity preserved

Solution 2: Cooperative Sticky Assignor (Kafka 2.4+)

Reduces rebalancing impact:

spring:
  kafka:
    consumer:
      partition-assignment-strategy:
        org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Benefits:

  • Incremental rebalancing
  • Only affected partitions move
  • Less disruption

Solution 3: Increase Session Timeout

Give more time before Kafka considers consumer dead:

spring:
  kafka:
    consumer:
      session-timeout-ms: 60000  # 60 seconds (default is 10s)
      heartbeat-interval-ms: 20000  # Should be 1/3 of session timeout

Challenge 2: Distributed Circuit Breaker State

In production, you have multiple consumer instances. Each has its own circuit breaker. They can have different states.

The Problem

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  CONSUMER GROUP                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                         β”‚
β”‚  Instance 1          Instance 2          Instance 3     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Circuit: β”‚        β”‚ Circuit: β”‚        β”‚ Circuit: β”‚   β”‚
β”‚  β”‚  OPEN    β”‚        β”‚  CLOSED  β”‚        β”‚ HALF-OPENβ”‚   β”‚
β”‚  β”‚          β”‚        β”‚          β”‚        β”‚          β”‚   β”‚
β”‚  β”‚ Paused   β”‚        β”‚ Running  β”‚        β”‚ Testing  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Problem: Inconsistent behavior across instances

Why This Happens

  • Each instance experiences failures independently
  • Network conditions vary
  • Some instances may hit the failing endpoint first
  • Circuit breakers don't share state by default

Solutions

Option 1: Accept Eventual Consistency

This is often the pragmatic choice:

  • Each instance protects itself
  • Eventually all circuits open
  • Slight inconsistency is acceptable
  • Simplest to implement

Option 2: Shared State with Redis

Store circuit state in Redis:

public class DistributedCircuitBreaker {

    private final RedisTemplate<String, String> redis;
    private final String circuitKey = "circuit:external-api";

    public boolean isOpen() {
        String state = redis.opsForValue().get(circuitKey);
        return "OPEN".equals(state);
    }

    public void open() {
        redis.opsForValue().set(circuitKey, "OPEN",
            Duration.ofSeconds(30));  // Auto-expire
    }

    public void close() {
        redis.delete(circuitKey);
    }
}

Trade-offs:

  • Adds Redis dependency
  • Adds latency to every check
  • Redis itself can fail

Option 3: Leader-Based Coordination

One instance is the "leader" that makes circuit decisions:

// Using Curator for ZooKeeper-based leader election
LeaderLatch leaderLatch = new LeaderLatch(client, "/circuit-leader");
leaderLatch.start();

if (leaderLatch.hasLeadership()) {
    // This instance decides circuit state
    broadcastCircuitState();
}

Trade-offs:

  • Complex setup
  • Leader election overhead
  • Single point of decision

Option 4: Gossip Protocol

Instances share circuit state peer-to-peer:

  • Eventually consistent
  • No central coordinator
  • More complex implementation

Challenge 3: The Polled Messages Problem

We touched on this in Part 2, but it deserves deeper exploration.

The Detailed Problem

Kafka Consumer Poll Cycle:
─────────────────────────────────────────────────────────▢

1. consumer.poll(Duration.ofMillis(100))
   └── Returns batch of 500 messages

2. Start processing message 1
   └── Call external API
   └── API fails
   └── Circuit breaker records failure

3. After 5 failures, circuit OPENS
   └── Call consumer.pause()

4. BUT: Messages 6-500 are still in memory!
   └── They continue to be passed to your handler
   └── Each one fails
   └── You can't "un-poll" them

Comprehensive Solution

@KafkaListener(id = "myListener", topics = "orders")
public void consume(
        List<ConsumerRecord<String, Order>> records,
        Acknowledgment ack,
        Consumer<?, ?> consumer) {

    int successCount = 0;

    for (ConsumerRecord<String, Order> record : records) {
        // Check circuit BEFORE each message
        if (circuitBreaker.getState() == State.OPEN) {
            log.info("Circuit open, rejecting remaining {} messages",
                records.size() - successCount);

            // Seek back to first unprocessed message
            consumer.seek(
                new TopicPartition(record.topic(), record.partition()),
                record.offset());

            // Pause immediately
            consumer.pause(consumer.assignment());

            // Don't acknowledge anything
            break;
        }

        try {
            processOrder(record.value());
            successCount++;
        } catch (Exception e) {
            // Seek back and stop
            consumer.seek(
                new TopicPartition(record.topic(), record.partition()),
                record.offset());
            break;
        }
    }

    // Only acknowledge what we successfully processed
    if (successCount == records.size()) {
        ack.acknowledge();
    }
}

Edge Case 1: Flapping Circuit

The circuit rapidly opens and closes, causing instability.

The Scenario

T0: Service at 51% error rate
T1: Circuit OPENS (threshold: 50%)
T2: Wait duration expires, HALF-OPEN
T3: Test request succeeds (service temporarily OK)
T4: Circuit CLOSES
T5: Resume processing
T6: Error rate back to 51%
T7: Circuit OPENS
... repeat forever

Solutions

Increase Success Threshold for Closing

resilience4j:
  circuitbreaker:
    instances:
      myBreaker:
        permitted-number-of-calls-in-half-open-state: 10
        # Require more successes before closing

Implement Exponential Backoff for Wait Duration

public class AdaptiveCircuitBreaker {
    private int consecutiveOpens = 0;
    private Duration baseWaitDuration = Duration.ofSeconds(30);

    public Duration getWaitDuration() {
        // Double wait time for each consecutive open
        return baseWaitDuration.multipliedBy(
            (long) Math.pow(2, consecutiveOpens));
    }

    public void onOpen() {
        consecutiveOpens++;
    }

    public void onClose() {
        consecutiveOpens = 0;  // Reset on successful close
    }
}

Add Hysteresis

Different thresholds for opening vs closing:

Open threshold:  50% failure rate
Close threshold: 20% failure rate (must be MUCH better to close)

Edge Case 2: Long-Running Message Processing

A message takes 30 minutes to process. Meanwhile, the circuit opens.

The Scenario

T0: Start processing large message (expected: 30 min)
T1: Other messages fail, circuit OPENS
T2: Consumer pauses for new messages
T30: Long message finishes... but fails at the end
T31: Circuit is OPEN, can't retry immediately
T32: Message offset not committed
T33: When circuit closes, message reprocesses
T34: Another 30-minute processing begins

Solutions

Implement Idempotency

Ensure reprocessing doesn't cause duplicate effects:

public void processLargeMessage(Message msg) {
    String messageId = msg.getId();

    // Check if already processed
    if (processedMessageStore.exists(messageId)) {
        log.info("Message {} already processed, skipping", messageId);
        return;
    }

    // Process
    Result result = doExpensiveProcessing(msg);

    // Mark as processed (atomically with result)
    processedMessageStore.markProcessed(messageId, result);
}

Checkpoint Progress

For very long operations, checkpoint intermediate progress:

public void processLargeMessage(Message msg) {
    Checkpoint checkpoint = checkpointStore.get(msg.getId());

    if (checkpoint != null) {
        // Resume from checkpoint
        continueFrom(checkpoint);
    } else {
        // Start fresh
        startProcessing(msg);
    }
}

Edge Case 3: Partial Topic Pause

Consumer subscribes to multiple topics, but only one topic's downstream fails.

The Scenario

Consumer subscribes to: orders, inventory, shipping

Only "orders" downstream (Payment API) is failing.

Single circuit breaker β†’ ALL topics pause
Even healthy inventory and shipping processing stops

Solutions

Separate Consumers per Topic

@KafkaListener(id = "ordersListener", topics = "orders")
public void consumeOrders(Order order) {
    // Has its own circuit breaker
    ordersCircuitBreaker.execute(() -> processOrder(order));
}

@KafkaListener(id = "inventoryListener", topics = "inventory")
public void consumeInventory(InventoryUpdate update) {
    // Has its own circuit breaker
    inventoryCircuitBreaker.execute(() -> processInventory(update));
}

Topic-Specific Pause

Kafka allows pausing specific topic-partitions:

// Only pause orders topic
consumer.pause(
    consumer.assignment().stream()
        .filter(tp -> tp.topic().equals("orders"))
        .collect(Collectors.toList())
);

// inventory and shipping continue

Edge Case 4: Exactly-Once Semantics Interaction

Using Kafka transactions with circuit breakers requires careful handling.

The Scenario

T0: Begin transaction
T1: Read message
T2: Process and produce to output topic
T3: About to commit
T4: Circuit opens mid-commit
T5: Transaction state unclear

Solution: Transaction-Aware Circuit Breaker

public void processWithTransaction(ConsumerRecord<K, V> record) {
    // Check circuit BEFORE starting transaction
    if (circuitBreaker.getState() == State.OPEN) {
        throw new CircuitOpenException();
    }

    try {
        producer.beginTransaction();

        Result result = circuitBreaker.executeSupplier(
            () -> callExternalApi(record.value()));

        producer.send(new ProducerRecord<>("output", result));

        // Send offsets within transaction
        producer.sendOffsetsToTransaction(
            Map.of(
                new TopicPartition(record.topic(), record.partition()),
                new OffsetAndMetadata(record.offset() + 1)
            ),
            consumerGroupId
        );

        producer.commitTransaction();

    } catch (Exception e) {
        producer.abortTransaction();
        throw e;
    }
}

Alternative Approaches

The circuit breaker + pause pattern isn't always the right choice. Here are alternatives.

Alternative 1: Dead Letter Queue (DLQ)

Main Topic ──▢ Consumer ──▢ Process ──▢ Success
                              β”‚
                              β–Ό Failure (after retries)
                            DLQ Topic

When to Use DLQ:

  • Poison pill messages (malformed data)
  • Non-retryable errors (validation failures)
  • Need to continue processing other messages
  • Message order is not critical

When NOT to Use DLQ:

  • System-wide outages (DLQ fills up quickly)
  • All messages would fail (need to replay entire DLQ)

Comparison:

| Aspect                 | Circuit Breaker | DLQ                    |
|------------------------|-----------------|------------------------|
| Message order          | Preserved       | Broken                 |
| System outage handling | Excellent       | Poor                   |
| Bad data handling      | Poor            | Excellent              |
| Operational complexity | Lower           | Higher (replay needed) |

Alternative 2: Retry Topics (Non-Blocking)

main-topic
    β”‚
    β”œβ”€β”€β–Ά Success ──▢ Done
    β”‚
    └──▢ Failure ──▢ retry-topic-1 (1 min delay)
                          β”‚
                          β”œβ”€β”€β–Ά Success ──▢ Done
                          β”‚
                          └──▢ Failure ──▢ retry-topic-2 (10 min delay)
                                               β”‚
                                               └──▢ Failure ──▢ DLQ

When to Use:

  • Mixed transient and permanent failures
  • Can't pause all processing
  • Acceptable to process messages out of order

Implementation:

@RetryableTopic(
    attempts = 3,
    backoff = @Backoff(delay = 60000, multiplier = 2),
    dltTopicSuffix = ".DLQ"
)
@KafkaListener(topics = "orders")
public void consume(Order order) {
    orderProcessor.process(order);  // Throws on failure
}

Alternative 3: Bulkhead Pattern

Isolate failures to specific resource pools:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Service                            β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚  Thread Pool A β”‚  β”‚  Thread Pool B β”‚                 β”‚
β”‚  β”‚                β”‚  β”‚                β”‚                 β”‚
β”‚  β”‚  Payment API   β”‚  β”‚  Shipping API  β”‚                 β”‚
β”‚  β”‚  Threads: 10   β”‚  β”‚  Threads: 10   β”‚                 β”‚
β”‚  β”‚                β”‚  β”‚                β”‚                 β”‚
β”‚  β”‚  ⚠ Exhausted   β”‚  β”‚  βœ“ Healthy     β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚                                                         β”‚
β”‚  Payment failures don't affect Shipping                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Best Used With: Circuit breaker (combine both patterns)

Alternative 4: Rate Limiting / Backpressure

Instead of stopping completely, slow down:

RateLimiter rateLimiter = RateLimiter.create(100); // 100/sec

public void consume(Order order) {
    rateLimiter.acquire();  // Blocks if rate exceeded
    orderProcessor.process(order);
}

When to Use:

  • API rate limits (429 responses)
  • Gradual degradation preferred
  • Downstream can handle reduced rate

Decision Matrix: Which Pattern to Use

| Scenario                | Circuit Breaker + Pause | DLQ  | Retry Topics | Rate Limiter |
|-------------------------|-------------------------|------|--------------|--------------|
| System-wide outage      | Best                    | Poor | Poor         | Poor         |
| Individual bad messages | Poor                    | Best | Good         | Poor         |
| Rate limiting (429s)    | Good                    | Poor | Poor         | Best         |
| Mixed failures          | Good                    | Good | Best         | Poor         |
| Order critical          | Best                    | Poor | Poor         | Good         |
| Latency critical        | Poor                    | Good | Good         | Good         |

The Honest Trade-offs

What You Gain

  1. Cascading failure prevention - Protect your system
  2. Message preservation - Nothing is lost
  3. Order preservation - Process in sequence
  4. Automatic recovery - No manual intervention

What You Pay

  1. Processing delays - Backlog builds during pause
  2. Complexity - More moving parts
  3. Potential rebalancing issues - Kafka consumer coordination
  4. All-or-nothing - Can't process some messages while paused

When This Pattern Shines

  • Downstream service has infrequent but extended outages
  • Message order matters
  • You can accept processing delays
  • You have few downstream dependencies

When to Choose Alternatives

  • Frequent individual message failures β†’ DLQ
  • Rate limiting issues β†’ Rate limiter
  • Must continue processing other messages β†’ Retry topics
  • Mixed failure patterns β†’ Combine approaches

What's Next

In Part 4, we'll cover:

  • Configuration and tuning (with real numbers)
  • Testing strategies and chaos engineering
  • Best practices from production experience
  • Anti-patterns to avoid
  • Complete decision framework

Share this post: