Kafka Topic Design: 5 Common Mistakes to Avoid

Kafka's power is that it works. The trap is that it works even when your design is wrong — until one day it doesn't, and the fix involves rewriting consumers, backfilling partitions, or both.

These are the five topic-design mistakes we see most often in engagements, with the concrete patterns that avoid them.

Mistake 1: Choosing partition count by guessing

The most common mistake. A team creates a topic with 3 partitions because that's the Kafka documentation default, or with 100 because "we might need to scale." Both are wrong.

Why it matters

Too few partitions caps your consumer parallelism. If your topic has 3 partitions, at most 3 consumers in a group can process it simultaneously. Add a 4th consumer — it sits idle.
Too many partitions hurts in three ways: more broker metadata overhead, more file handles, more leader elections during broker churn, and higher end-to-end latency from fanout.

How to choose

Work backwards from throughput targets:

Measure (or estimate) the per-partition throughput your consumer can sustain. For a typical Java consumer doing light processing, that's ~10-50 MB/sec. For a heavy consumer with external calls, maybe 1-5 MB/sec.
Take your peak expected throughput and divide. If you need 200 MB/sec and each consumer handles 20 MB/sec, you need at least 10 partitions.
Multiply by 2-3× for headroom (traffic spikes, consumer slowdowns, rebalancing overhead).
Round up to a number that divides cleanly by typical consumer group sizes.

For most topics, the right answer is between 6 and 30 partitions. Above 50, you should have a specific reason.

⚠️Partitions are hard to change

Increasing partitions breaks keyed ordering (keys may now hash to different partitions). Decreasing partitions requires recreating the topic and backfilling. Pick conservatively and revise deliberately.

Mistake 2: Using bad partition keys (or none at all)

Kafka routes messages to partitions by hashing the key. If your key is poorly chosen, you get hot partitions (one partition overwhelmed while others idle) or broken ordering guarantees.

Bad keys we've seen

null when ordering matters. Using no key means round-robin assignment — no ordering guarantee for related events. If all events for user X need to process in order, key them by user_id.
Timestamps as keys. Every current event goes to whichever partition handles "recent" hashes. Hot-spotting guaranteed.
Low-cardinality enums. Keying by country_code means all of India's events land on one partition. If India's your biggest market, that partition is your bottleneck.
High-cardinality but skewed keys. One "super user" generating 80% of events hammers one partition.

Good key patterns

Natural aggregate ID. User ID, order ID, device ID — whatever the events relate to.
Composite keys when you need stronger distribution. For example, (tenant_id, user_id) so load spreads even if one tenant is huge.
Explicit repartitioning. If you have a naturally skewed key, add a random suffix for writes and use a separate downstream process for ordering.

Mistake 3: Wrong retention and compaction settings

Every Kafka topic has two time-related knobs: how long to keep messages (retention.ms) and whether to compact (keep only the latest message per key). Teams routinely pick the wrong combination.

The defaults are wrong for most workloads

Kafka's default retention is 7 days. For most event streams this is fine — but:

Audit logs often need weeks or months of retention
Low-volume configuration topics should use log compaction, not time-based retention
High-volume firehoses often only need hours of retention — days of it wastes disk

The right mental model

Think of topics in three classes:

Transient event streams (user clicks, page views, sensor readings)
- Time-based retention: 3-14 days
- No compaction
- Replay consumers can reset to beginning of retention window if needed
State-carrying topics (current user profile, current order status)
- Log compaction enabled (cleanup.policy=compact)
- Retain latest message per key forever
- Used for materializing current-state tables
Audit / compliance topics (financial transactions, access logs)
- Long retention (30-365+ days)
- Consider tiered storage (S3 backing) for cost
- Often replicated to a secondary system for long-term analytics

💡Name topics for their class

Prefix topics so their class is obvious: events.page-views, state.user-profiles, audit.transactions. Makes retention policies easier to reason about and enforces different naming conventions per class.

Mistake 4: Poor topic naming

This sounds trivial. It's not. Bad topic names cause real pain when you have 200+ topics and three teams producing to them.

The pattern we use

Use a naming convention that encodes domain and purpose:

<domain>.<entity>.<event_or_state>

Examples:

billing.invoice.created
product.catalog.updated
analytics.pageview.events
identity.user.state

Rules that prevent drift:

lowercase only
no environment suffix in topic name (prod belongs in cluster/account separation)
avoid team names (teams change; domains persist)
differentiate event topics from state topics in naming

Mistake 5: No schema governance

A topic is not just bytes. It is a contract between teams.

Without schema governance:

producers ship breaking changes silently
consumers fail in production unexpectedly
replay and backfill become risky

The fix: explicit compatibility policy

Use Avro/Protobuf + schema registry with enforced compatibility checks.

Pick policy per topic:

backward compatible for most event streams
full compatibility for cross-team shared contracts
explicit versioned topics when breaking changes are unavoidable

Treat schema changes like API changes: reviewed, tested, and announced.

Operational checks every Kafka team should run

Weekly:

partition skew by topic
consumer lag by group and partition
retry/dead-letter volume trends
broker disk pressure and segment growth
under-replicated partition incidents

Monthly:

topic retention review by usage class
partition-count forecast against throughput growth
schema compatibility incidents and near-misses

These checks catch expensive problems long before outages.

Topic design review checklist

Before creating a new topic, confirm:

What business contract does this topic represent?
What ordering guarantees do consumers need?
What key gives both ordering and acceptable distribution?
What retention/compaction policy matches this topic class?
What compatibility policy applies to schema changes?
Who owns producer and consumer contract health?

If these answers are unclear, the topic design is incomplete.

Closing

Kafka reliability is mostly design discipline, not broker wizardry. Good topic contracts, keys, and governance policies prevent most painful migrations and operational fires.

Capabilities: Data Engineering and Infrastructure Optimization
Case study: SaaS real-time analytics pipeline
Deep dive: Event schema design mistakes

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

Share this article

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

RAG evaluation: the tests we run before shipping any LLM feature

Cloud cost: a checklist before your next AWS bill surprise

Infra

Kafka topic design: the 5 mistakes we see most often

Mistake 1: Choosing partition count by guessing

Why it matters

How to choose

Mistake 2: Using bad partition keys (or none at all)

Bad keys we've seen

Good key patterns

Mistake 3: Wrong retention and compaction settings

The defaults are wrong for most workloads

The right mental model

Mistake 4: Poor topic naming

The pattern we use

Mistake 5: No schema governance

The fix: explicit compatibility policy

Operational checks every Kafka team should run

Topic design review checklist

Closing

Tags

Anthra AI Team

Share this article

Get insights like this weekly

Related posts

When ClickHouse beats Postgres for analytics — and when it doesn't

Edge or origin? A decision framework for latency-sensitive features

Need help building this?

Mistake 1: Choosing partition count by guessing

Why it matters

How to choose

Mistake 2: Using bad partition keys (or none at all)

Bad keys we've seen

Good key patterns

Mistake 3: Wrong retention and compaction settings

The defaults are wrong for most workloads

The right mental model

Mistake 4: Poor topic naming

The pattern we use

Mistake 5: No schema governance

The fix: explicit compatibility policy

Operational checks every Kafka team should run

Topic design review checklist

Closing

Related resources

Tags

Anthra AI Team

Share this article

Get insights like this weekly

Related posts

When ClickHouse beats Postgres for analytics — and when it doesn't

Edge or origin? A decision framework for latency-sensitive features

Need help building this?