Circuit Breaker Pattern with Kafka: A Complete Guide - Part 1

>by Roman Tsyupryk
>

Circuit Breaker Pattern with Kafka: A Complete Guide

Part 1: Understanding the Fundamentals

Series Navigation:


Introduction

Picture this: Your microservice consumes events from Kafka and processes them by calling several external APIs. Everything works fine until one of those APIs starts failing. Your consumer keeps pulling messages, keeps failing, and keeps retrying. Threads pile up. Memory grows. Soon, your entire service is overwhelmed, and the problem cascades to other services.

This is exactly the problem the circuit breaker pattern solves. When combined with Kafka's consumer pause/resume capabilities, it creates a resilient system that gracefully handles downstream failures while preserving messages for later processing.

In this series, we'll explore how to implement this pattern effectively, understand its trade-offs, and learn from real-world scenarios.


The Problem: Why Traditional Retry Isn't Enough

When a downstream service fails, the natural instinct is to retry. But retries alone create several problems:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Kafka  │────▢│ Microservice│────▢│ External APIβ”‚ FAILING
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
              Keeps consuming events
              Keeps failing
              Keeps retrying
              Resources exhausted
              Cascading failures begin

The Retry Storm Problem

Imagine 1,000 consumers all experiencing the same downstream failure:

  • Each consumer retries 3 times
  • That's 3,000 requests hitting an already struggling service
  • The service crashes completely
  • Recovery becomes impossible

Thread Pool Exhaustion

While waiting for timeouts on failing calls:

  • Threads remain blocked
  • New messages can't be processed
  • Memory usage climbs
  • Eventually, the service becomes unresponsive

The Cascading Effect

When Service A waits on failing Service B:

  • Service A's threads get blocked
  • Services calling A start timing out
  • The failure spreads through your entire system

The Solution: Circuit Breaker + Kafka Pause

The circuit breaker pattern, originally described by Michael Nygard in "Release It!" (2007), works like an electrical circuit breaker. When too many failures occur, it "trips" and stops sending requests to the failing service.

Combined with Kafka's pause/resume mechanism, this creates a pattern where:

  1. Circuit breaker monitors downstream API health
  2. When failures exceed threshold, circuit "opens"
  3. Kafka consumer pauses - no more messages fetched
  4. Messages safely accumulate in Kafka
  5. Circuit periodically tests if service recovered
  6. When recovered, consumer resumes and processes backlog
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MICROSERVICE                          β”‚
β”‚                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Kafka    β”‚    β”‚ Circuit Breaker β”‚    β”‚ External  β”‚  β”‚
β”‚  β”‚  Consumer  │───▢│                 │───▢│   APIs    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚        β–²                   β”‚                             β”‚
β”‚        β”‚                   β”‚                             β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚         pause/resume signal                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Three States of a Circuit Breaker

A circuit breaker operates in three distinct states:

1. CLOSED State (Normal Operation)

  • All requests flow through normally
  • Circuit breaker monitors success/failure rates
  • Kafka consumer is running
  • This is the healthy state

2. OPEN State (Failure Detected)

  • All requests are immediately rejected
  • No calls made to downstream service
  • Kafka consumer is paused
  • System is protected from cascade

3. HALF-OPEN State (Testing Recovery)

  • A limited number of test requests are allowed
  • If tests succeed, circuit closes
  • If tests fail, circuit reopens
  • Kafka consumer typically remains paused during testing
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                                     β”‚
            β–Ό                                     β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
    β”‚    CLOSED    │◀──────────────────────────────
    β”‚              β”‚                              β”‚
    β”‚ Normal flow  β”‚      Recovery                β”‚
    β”‚ Kafka: RUN   β”‚      confirmed               β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
           β”‚                                      β”‚
           β”‚ Failures                             β”‚
           β”‚ exceed                               β”‚
           β”‚ threshold                            β”‚
           β–Ό                                      β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
    β”‚     OPEN     β”‚                              β”‚
    β”‚              β”‚                              β”‚
    β”‚ Requests     β”‚                              β”‚
    β”‚ rejected     β”‚                              β”‚
    β”‚ Kafka: PAUSE β”‚                              β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
           β”‚                                      β”‚
           β”‚ Timeout                              β”‚
           β”‚ expires                              β”‚
           β–Ό                                      β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
    β”‚  HALF-OPEN   β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚              β”‚     Test succeeds
    β”‚ Test         β”‚
    β”‚ requests     β”‚
    β”‚ allowed      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β”‚ Test fails
           β”‚
           └──────▢ Back to OPEN

How Kafka Consumer Pause/Resume Works

Kafka clients provide pause and resume methods that control message fetching:

When Paused

  • Consumer stops fetching new messages from broker
  • Already-fetched messages in local buffer may still process
  • Consumer continues sending heartbeats (stays in group)
  • Partition assignments remain unchanged
  • Messages accumulate safely on Kafka broker

When Resumed

  • Consumer begins fetching messages again
  • Processing continues from where it left off
  • Backlog is processed in order

Key Insight: Messages Are Not Lost

Unlike a Dead Letter Queue approach where failed messages move to a separate topic, pausing keeps messages in the original topic. When the service recovers, messages process in their original order with no replay mechanism needed.

PAUSED STATE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Kafka   β”‚    β”‚ Consumer β”‚   No fetching
β”‚ Broker  β”‚ βœ—  β”‚ paused   β”‚   Messages retained
β”‚         β”‚    β”‚          β”‚   Offset unchanged
β”‚ msgs    β”‚    β”‚          β”‚   Heartbeat continues
β”‚ waiting β”‚    β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RESUMED STATE:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Kafka   │───▢│ Consumer │───▢│ Process msg β”‚
β”‚ Broker  β”‚    β”‚ fetches  β”‚    β”‚ Commit      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Benefits of This Pattern

1. Prevents Cascading Failures

When one service fails, the circuit breaker isolates it. Other parts of your system continue functioning normally.

2. Preserves Message Order

Unlike DLQ or retry topics, messages stay in place. When recovery happens, they process in the exact order they arrived.

3. Automatic Recovery

The HALF-OPEN state automatically tests if the downstream service has recovered. No manual intervention required.

4. Resource Protection

By pausing consumption and failing fast, you prevent thread exhaustion, memory overflow, and CPU waste.

5. Visibility Into System Health

Circuit state changes provide clear signals about downstream health. When a circuit opens, you know something is wrong.


When to Use This Pattern

| Scenario                                   | Use Circuit Breaker + Pause? |
|--------------------------------------------|------------------------------|
| Downstream API outages                     | Yes                          |
| Database connectivity issues               | Yes                          |
| External service rate limiting             | Yes                          |
| Transient network failures                 | Yes, with retries first      |
| Data validation errors                     | No, use DLQ                  |
| Poison pill messages                       | No, use DLQ                  |
| Need to continue processing other messages | No, use retry topics         |

The Key Question

Ask yourself: "When the downstream service fails, should I stop processing ALL messages until it recovers?"

  • If yes β†’ Circuit breaker + pause
  • If no β†’ Consider DLQ or retry topics

Available Libraries

The circuit breaker pattern is implemented in mature libraries across all major languages:

| Language | Library          | Status              |
|----------|------------------|---------------------|
| Java     | Resilience4j     | Active, recommended |
| Java     | Hystrix          | Deprecated          |
| .NET     | Polly            | Active, recommended |
| Node.js  | opossum          | Active              |
| Python   | pybreaker        | Active              |
| Go       | gobreaker (Sony) | Active              |
| Rust     | failsafe-rs      | Active              |

Service Mesh Alternative

If you're using Kubernetes with Istio, circuit breaking can be configured at the network level without code changes. However, this works better for synchronous HTTP calls than for Kafka consumer patterns.


What's Next

In Part 2, we'll dive into implementation details:

  • Step-by-step implementation with Resilience4j and Spring Kafka
  • Real-world scenarios with solutions
  • Handling the "already-polled messages" problem
  • Multiple circuit breakers for multiple downstream services

Quick Reference

Circuit Breaker States

| State     | Description                 | Kafka Consumer   |
|-----------|-----------------------------|------------------|
| CLOSED    | Normal operation            | Running          |
| OPEN      | Failures exceeded threshold | Paused           |
| HALF-OPEN | Testing recovery            | Paused (testing) |

State Transitions

| From      | To        | Trigger                    |
|-----------|-----------|----------------------------|
| CLOSED    | OPEN      | Failure threshold exceeded |
| OPEN      | HALF-OPEN | Wait timeout expired       |
| HALF-OPEN | CLOSED    | Test requests succeed      |
| HALF-OPEN | OPEN      | Test requests fail         |

Share this post: