Kafka for a beginner

Dilusha Dasanayaka
4 min readMar 29, 2019

Linkedin was in need to monitor and track their users' activities. the existing service was too clunky, which that polling model for monitoring was not compatible with the push model for tracking and was too fragile to use for metrics. Developers in LinkedIn tried off-the-shelf solutions and prototype systems using ActiveMQ to handle the amount of message traffic needed. But all of them were could not handle the scale.

Finally, the decision was made to move forward with a custom infrastructure for the data pipeline. The development team at Linkedin which was led by Jay Kreps, opted to come up with a solution with the following goals.

  1. Decouple producers and consumers by using a push-pull model
  2. Provide persistence for message data within the messaging system to allow multiple consumers
  3. Optimize for high throughput of messages
  4. Allow for horizontal scaling of the system to grow as the data streams grew

The result was a publish/subscribe messaging system that had an interface typical of messaging systems but a storage layer more like a log-aggregation system which is called Kafka.

https://www.confluent.io/what-is-apache-kafka/

What is Apache Kafka

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform. ~ confluent

Messages and Batches

Simply, a message is an array of bytes and there is no specific format or meaning for the data contained it to Kafka. Messages are written into Kafka in batches. So, a batch is just a collection of messages, which are being produced to the same topic and partition.

Topics and Partitions

Messages are grouped into topics by database table or folder in filesystem etc. Topics are broken down into a number of partitions. Partitions are also a way that Kafka provides redundancy and scalability. Each partition can be hosted on a different server.

Representation of a topic with multiple partitions

Producers and Consumers

Producers create new messages on a specific topic. The producer could also use a custom partitioner that follows other business rules for mapping messages to partitions. Consumers read messages and they subscribe to one or more topics. They read the messages in the order they were created.

A consumer group reading from a topic

Brokers and Clusters

In Kafka, a single server is called a broker and it receives messages from producers assigns offsets to them, and commits the messages to storage on disk. Brokers’ performances are depending on the specific hardware and its characteristics.

Kafka brokers are designed in a way that they operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller. The controller is responsible for administrative operations including assigning partitions to brokers and monitoring for broker failures.

Replication of partitions in a cluster

Multiple Clusters

We can have multiple clusters when the deployment grows. These are useful in following manners;

  1. Segregation of types of data
  2. Isolation for security requirements
  3. Multiple datacenters (disaster recovery)

When working with multiple datacenters, in particular, it is often required that messages be copied between them. The replication mechanisms within the Kafka clusters are designed only to work within a single cluster, not between multiple clusters.

Multiple datacenter architecture

Features of Kafka

  1. Multiple Producers
  2. Multiple Consumers
  3. Disk-Based Retention
  4. Scalable
  5. High Performance

--

--