Kafka at 200k DAU: what breaks and how to fix it

We hit about 200k daily active users on our platform at peak. Most of the stack held up fine. Kafka was where we learned the most.

Here's what I wish I'd known before we got there.

Partition count matters more than you think

We started with 3 partitions on most topics. That was fine at low throughput. As DAU climbed, consumer lag started growing. The fix was obvious in retrospect: increase partition count to allow more parallel consumers.

But you can't decrease partition count once it's set, and repartitioning is painful. We ended up creating new topics with higher partition counts and doing a gradual cutover. Not the worst migration, but definitely avoidable.

Rule of thumb we settled on: target_throughput / single_consumer_throughput * 2 for partition count. Overestimate early.

Consumer group offsets during deployments

We had a fun incident where a deployment of consumer services reset offsets to earliest. A backlog of 4 million messages started processing all at once and took down our downstream database.

The root cause was a misconfigured auto.offset.reset=earliest that should have been latest for this particular consumer group. Combine that with a new consumer group ID being introduced accidentally during deployment and you have a recipe for a bad day.

We added offset monitoring as a first-class concern after that. If a consumer group's lag spikes above 10k within 5 minutes of a deployment, we page.

Dead letter queues save your sanity

If you don't have DLQ handling, poison pill messages will block your consumers forever. We had a message with malformed data that crashed the processor on every attempt. The consumer kept retrying, never advancing past that offset.

The fix:

async function processMessage(message: KafkaMessage) {
  try {
    await handler(message);
  } catch (err) {
    if (isRetryable(err) && message.retryCount < MAX_RETRIES) {
      await producer.send({
        topic: `${message.topic}.retry`,
        messages: [{ ...message, retryCount: message.retryCount + 1 }],
      });
    } else {
      await producer.send({
        topic: `${message.topic}.dlq`,
        messages: [{ ...message, error: String(err) }],
      });
    }
  }
}

With proper DLQ routing, bad messages get sidelined and you can inspect and reprocess them manually without blocking the pipeline.

Redis as a coordination layer

We used Redis alongside Kafka for a few things that Kafka isn't great at: deduplication, rate limiting per user, and low-latency counters.

For deduplication: before processing a Kafka message, we'd set a key in Redis with a 24h TTL. If the key exists, skip the message. This handled the at-least-once delivery guarantee without polluting the DB with duplicates.

const dedupeKey = `kafka:dedup:${topic}:${message.id}`;
const alreadyProcessed = await redis.set(dedupeKey, '1', 'EX', 86400, 'NX');
if (!alreadyProcessed) return; // duplicate

The combination of Kafka for durable queueing and Redis for fast coordination worked well for us. Don't try to make Kafka do everything.

Takeaway

Kafka is fantastic once you understand its model. The tricky part is that failures are often subtle and delayed — your pipeline looks healthy until suddenly it doesn't, and by then you're staring at a 10-hour consumer lag. Build observability in from the start.