A Guide to Kafka Clustering in Ubuntu 18.04

Kekayan
7 min readMar 29, 2019

--

Welcome back with another guide. Let’s create a cluster and understand what is Apache Kafka. Kafka is a distributed publish-subscribe messaging application written by LinkedIn. Kafka is often compared to Apache ActiveMQ or RabbitMQ, but the fundamental difference is that Kafka does not implement JMS (Java Message Service).

  1. What is Kafka
  2. The architecture of the Kafka
  3. Diving into Components ,Partition & Replication
  4. Features of Kafka
  5. Setting up a Kafka server on Ubuntu
  6. Setting Up a Kafka Cluster

What is Kafka?

A Messaging System. That’s it? if so why people are excited. But Kafka is more than that, It’s a distributed streaming platform. It means we can use Kafka as

  • a Messaging System
  • a Storage System
  • for Stream Processing

You can read more on the official tutorial about it https://kafka.apache.org/intro.html.Let’s check the quick overview and set up a small cluster in the local machine.

The architecture of the Kafka

we will start with some basics on how Kafka works, the basic terminologies, and the components.

Kafka is a messaging system. So we have messages. Messages are organized into topics. So when we need to send a message, we send it to a specific topic. When we want to read a message, we decide which topic we are going to read a message from. Producers push the messages. Consumers pull the messages. So Kafka is a consumer pull system.

Kafka Architecture

As a consumer, subscribe and get the messages. Kafka runs in the cluster. We are going to call the different servers or nodes as brokers.

Kafka has four main API

  • Producer API
  • Consumer API
  • Streams API
  • Connector API

we can find more on the official blog

Diving into Components ,Partition & Replication

Components: Topic

Kafka stores and organizes messages as a collection. They are known as Topics.

Partition & Replication

we can replicate and partition Topics. From the above image, we see that a topic can have multiple partitions. Partitions are how we parallelize the topic. So if we have a lot of data in a certain topic and some systems will have a lot of data we want to have it on more than a single machine. So we partition it into multiple partitions. And then we can put each partition on a separate machine. It allows we more memory, more disk space. Consumers can also parallelize. So we can have multiple consumers on the same topic, each reading different partitions. So it actually allows us to get more throughput out of the system

Inside a partition, we see that each message has an ID. Those IDs are called offsets When we consume messages, we say I want to consume messages from this offset and continue on. And then if, for example, we break the connection and we want to start again, we can decide which offset we want to start with. And Kafka preserves the order.

Kafka Official Site

So producer writes into a topic, it writes messages in a certain order, and then it gets all those offsets in the same order. And we read them and we know that we are getting them in the same order.

credits: Apache Kafka site

So we can see partitions 0, 1, and 2. each partition has multiple replicas. That’s how we do high availability. So the partition has copies on different servers. If the server goes away, we roll to a different topic.

Components: Kafka Producer

  • pushes messages to a Kafka topic. Also responsible for choosing which record to assign to which partition within the topic.

Components: Kafka Consumer

  • subscribes to a topic pulls/ reads and processes messages from the topic.

Components: Kafka Broker

  • Broker known as Servers/Instances manages the storage of messages and exchange of messages. In a cluster, there are multiple Brokers

Components: Kafka Zookeeper

  • To offer the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election, Kafka uses Kafka zookeeper.

Features of Kafka

Like I said before, Kafka is not just a messaging system. It has more to it which makes it more standout from the other platforms. Following are some of those features.

  • Scalability :

Since Kafka has a distributed architecture, it gives the capability of scaling horizontally. This is achieved by partitioning the topic and distributing it to different nodes. Another thing that helps with scalability is Consumer groups. These two concepts combined together to ensure that there is no down time during a scaling process.

  • Fault Tolerance

If there is N replicas in the cluster, the Kafka cluster can tolerate upto N-1 failures. This ensures the availability as well as recovery of data in a dead Controller scenario

  • Zero Data Loss

Since writing process is synchronous and acknowledgement is sent to the client after the replication process, Kafka cluster guarantees that no data is lost.

  • Durability

Kafka cluster is writing the data into the disk in each transaction, hence, its data is durable. This feature is total contradiction with compared to Redis clustering.

  • Stream Processing

Kafka’s specialty lies in processing the streams of data. They have a special dedicated Stream API for this purpose. It has many functionalities like aggregations, joins, etc.

Setting up a Kafka server on Ubuntu

Prerequisite: JVM

Download the Kafka from here or by the following command in terminal

wget https://www.apache.org/dyn/closer.cgi?path=/kafka/2.2.0/kafka_2.12-2.2.0.tgz

Let’s extract the file by the following command

tar -xzf kafka_2.12-2.1.1.tgz
after extract

cd into extracted directory and start the zookeeper first

bin/zookeeper-server-start.sh config/zookeeper.properties
this means zookeeper started

Let's start Kafka server

bin/kafka-server-start.sh config/server.properties

Create a Topic

I am creating a topic called TutorialTopic

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TutorialTopic

Let’s see the topic

bin/kafka-topics.sh --list --zookeeper localhost:2181

Let’s Push a message to the TutorialTopic we created

echo "Hello, World" | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TutorialTopic > /dev/null

Let’s put another message to check in which order it reads later. with the following message “Hello, from Kekayan”

push the second message

Let’s check/Pull the Topic to see the message

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic --from-beginning
we can see first it read the first written one since we give tag from-beginning

as following image shows by running both sides by side we can see the how producer-consumer works .when we push it appears in consumers console

Setting Up a Kafka Cluster

we have played with a single broker, the real fun is playing with multiple cluster.let’s do it. First, we create the config files for the other two servers.

cp config/server.properties config/server-1.properties
cp config/server.properties config/server-2.properties

edit config for both new configs as follows

config/server-1.properties:

broker.id=1listeners=PLAINTEXT://:9093log.dirs=/tmp/kafka-logs-1

config/server-2.properties:

broker.id=2listeners=PLAINTEXT://:9094log.dirs=/tmp/kafka-logs-1

The broker.id property is the unique and permanent name of each node in the cluster. We have to change the port and log directory only because we are running these all on the same machine.

now start these two servers

bin/kafka-server-start.sh config/server-1.properties 
bin/kafka-server-start.sh config/server-2.properties

Let’s create a new topic with replication factor 3 since we have 3 servers now.I name it as clusterTopic

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic clusterTopic
ClusterTopic created

Let’s check the topic

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic clusterTopic
we can see 3 replications

Let’s create another topic with 2 partitions

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 2 --topic clusterPartionTopic
we can see 2 partitions now

Now we can use one as producer and others and itself as the consumer to check the cluster

#create a producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic clusterPartionTopic
#consumer 1
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic clusterPartionTopic --from-beginning
#consumer 2
bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic clusterPartionTopic --from-beginning
#consumer 3
bin/kafka-console-consumer.sh --bootstrap-server localhost:9094 --topic clusterPartionTopic --from-beginning

happy clustering :)

PS: I got great help and information from official site check it for further https://kafka.apache.org/quickstart

--

--

No responses yet