Skip to content
Data

Kafka topic design: the 5 mistakes we see most often

Partitioning keys, retention, compaction, naming, and schema evolution. The design decisions that become expensive when wrong.

Anthra AI TeamAnthra AI Team
Engineering Team6 min read
Kafka topic design: the 5 mistakes we see most often hero image
Table of contents
  1. Mistake 1: Choosing partition count by guessing
  2. Why it matters
  3. How to choose
  4. Mistake 2: Using bad partition keys (or none at all)
  5. Bad keys we've seen
  6. Good key patterns
  7. Mistake 3: Wrong retention and compaction settings
  8. The defaults are wrong for most workloads
  9. The right mental model
  10. Mistake 4: Poor topic naming
  11. The pattern we use
  12. Mistake 5: No schema governance
  13. The fix: explicit compatibility policy
  14. Operational checks every Kafka team should run
  15. Topic design review checklist
  16. Closing
  17. Related resources

Kafka's power is that it works. The trap is that it works even when your design is wrong — until one day it doesn't, and the fix involves rewriting consumers, backfilling partitions, or both.

These are the five topic-design mistakes we see most often in engagements, with the concrete patterns that avoid them.

Mistake 1: Choosing partition count by guessing

The most common mistake. A team creates a topic with 3 partitions because that's the Kafka documentation default, or with 100 because "we might need to scale." Both are wrong.

Why it matters

  • Too few partitions caps your consumer parallelism. If your topic has 3 partitions, at most 3 consumers in a group can process it simultaneously. Add a 4th consumer — it sits idle.
  • Too many partitions hurts in three ways: more broker metadata overhead, more file handles, more leader elections during broker churn, and higher end-to-end latency from fanout.

How to choose

Work backwards from throughput targets:

  1. Measure (or estimate) the per-partition throughput your consumer can sustain. For a typical Java consumer doing light processing, that's ~10-50 MB/sec. For a heavy consumer with external calls, maybe 1-5 MB/sec.

  2. Take your peak expected throughput and divide. If you need 200 MB/sec and each consumer handles 20 MB/sec, you need at least 10 partitions.

  3. Multiply by 2-3× for headroom (traffic spikes, consumer slowdowns, rebalancing overhead).

  4. Round up to a number that divides cleanly by typical consumer group sizes.

For most topics, the right answer is between 6 and 30 partitions. Above 50, you should have a specific reason.

⚠️Partitions are hard to change

Increasing partitions breaks keyed ordering (keys may now hash to different partitions). Decreasing partitions requires recreating the topic and backfilling. Pick conservatively and revise deliberately.

Mistake 2: Using bad partition keys (or none at all)

Kafka routes messages to partitions by hashing the key. If your key is poorly chosen, you get hot partitions (one partition overwhelmed while others idle) or broken ordering guarantees.

Bad keys we've seen

  • null when ordering matters. Using no key means round-robin assignment — no ordering guarantee for related events. If all events for user X need to process in order, key them by user_id.
  • Timestamps as keys. Every current event goes to whichever partition handles "recent" hashes. Hot-spotting guaranteed.
  • Low-cardinality enums. Keying by country_code means all of India's events land on one partition. If India's your biggest market, that partition is your bottleneck.
  • High-cardinality but skewed keys. One "super user" generating 80% of events hammers one partition.

Good key patterns

  • Natural aggregate ID. User ID, order ID, device ID — whatever the events relate to.
  • Composite keys when you need stronger distribution. For example, (tenant_id, user_id) so load spreads even if one tenant is huge.
  • Explicit repartitioning. If you have a naturally skewed key, add a random suffix for writes and use a separate downstream process for ordering.

Mistake 3: Wrong retention and compaction settings

Every Kafka topic has two time-related knobs: how long to keep messages (retention.ms) and whether to compact (keep only the latest message per key). Teams routinely pick the wrong combination.

The defaults are wrong for most workloads

Kafka's default retention is 7 days. For most event streams this is fine — but:

  • Audit logs often need weeks or months of retention
  • Low-volume configuration topics should use log compaction, not time-based retention
  • High-volume firehoses often only need hours of retention — days of it wastes disk

The right mental model

Think of topics in three classes:

  1. Transient event streams (user clicks, page views, sensor readings)

    • Time-based retention: 3-14 days
    • No compaction
    • Replay consumers can reset to beginning of retention window if needed
  2. State-carrying topics (current user profile, current order status)

    • Log compaction enabled (cleanup.policy=compact)
    • Retain latest message per key forever
    • Used for materializing current-state tables
  3. Audit / compliance topics (financial transactions, access logs)

    • Long retention (30-365+ days)
    • Consider tiered storage (S3 backing) for cost
    • Often replicated to a secondary system for long-term analytics
💡Name topics for their class

Prefix topics so their class is obvious: events.page-views, state.user-profiles, audit.transactions. Makes retention policies easier to reason about and enforces different naming conventions per class.

Mistake 4: Poor topic naming

This sounds trivial. It's not. Bad topic names cause real pain when you have 200+ topics and three teams producing to them.

The pattern we use

Use a naming convention that encodes domain and purpose:

<domain>.<entity>.<event_or_state>

Examples:

  • billing.invoice.created
  • product.catalog.updated
  • analytics.pageview.events
  • identity.user.state

Rules that prevent drift:

  • lowercase only
  • no environment suffix in topic name (prod belongs in cluster/account separation)
  • avoid team names (teams change; domains persist)
  • differentiate event topics from state topics in naming

Mistake 5: No schema governance

A topic is not just bytes. It is a contract between teams.

Without schema governance:

  • producers ship breaking changes silently
  • consumers fail in production unexpectedly
  • replay and backfill become risky

The fix: explicit compatibility policy

Use Avro/Protobuf + schema registry with enforced compatibility checks.

Pick policy per topic:

  • backward compatible for most event streams
  • full compatibility for cross-team shared contracts
  • explicit versioned topics when breaking changes are unavoidable

Treat schema changes like API changes: reviewed, tested, and announced.

Operational checks every Kafka team should run

Weekly:

  • partition skew by topic
  • consumer lag by group and partition
  • retry/dead-letter volume trends
  • broker disk pressure and segment growth
  • under-replicated partition incidents

Monthly:

  • topic retention review by usage class
  • partition-count forecast against throughput growth
  • schema compatibility incidents and near-misses

These checks catch expensive problems long before outages.

Topic design review checklist

Before creating a new topic, confirm:

  1. What business contract does this topic represent?
  2. What ordering guarantees do consumers need?
  3. What key gives both ordering and acceptable distribution?
  4. What retention/compaction policy matches this topic class?
  5. What compatibility policy applies to schema changes?
  6. Who owns producer and consumer contract health?

If these answers are unclear, the topic design is incomplete.

Closing

Kafka reliability is mostly design discipline, not broker wizardry. Good topic contracts, keys, and governance policies prevent most painful migrations and operational fires.

Tags

Anthra AI Team

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

More posts by Anthra AI Team

Share this article

Share

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

Previous post

RAG evaluation: the tests we run before shipping any LLM feature

AI

Next post

Cloud cost: a checklist before your next AWS bill surprise

Infra