Welcome back with another guide. Let’s create a cluster and understand what is Apache Kafka. Kafka is a distributed publish-subscribe messaging application written by LinkedIn. Kafka is often compared to Apache ActiveMQ or RabbitMQ, but the fundamental difference is that Kafka does not implement JMS (Java Message Service).
- What is Kafka
- The architecture of the Kafka
- Diving into Components ,Partition & Replication
- Features of Kafka
- Setting up a Kafka server on Ubuntu
- Setting Up a Kafka Cluster
What is Kafka?
A Messaging System. That’s it? if so why people are excited. But Kafka is more than that, It’s a distributed streaming platform. It means we can use Kafka as
- a Messaging System
- a Storage System
- for Stream Processing
You can read more on the official tutorial about it https://kafka.apache.org/intro.html.Let’s check the quick overview and set up a small cluster in the local machine.
The architecture of the Kafka
we will start with some basics on how Kafka works, the basic terminologies, and the components.
Kafka is a messaging system. So we have messages. Messages are organized into topics. So when we need to send a message, we send it to a specific topic. When we want to read a message, we decide which topic we are going to read a message from. Producers push the messages. Consumers pull the messages. So Kafka is a consumer pull system.
As a consumer, subscribe and get the messages. Kafka runs in the cluster. We are going to call the different servers or nodes as brokers.
Kafka has four main API
- Producer API
- Consumer API
- Streams API
- Connector API
we can find more on the official blog
Diving into Components ,Partition & Replication
Components: Topic
Kafka stores and organizes messages as a collection. They are known as Topics.
Partition & Replication
we can replicate and partition Topics. From the above image, we see that a topic can have multiple partitions. Partitions are how we parallelize the topic. So if we have a lot of data in a certain topic and some systems will have a lot of data we want to have it on more than a single machine. So we partition it into multiple partitions. And then we can put each partition on a separate machine. It allows we more memory, more disk space. Consumers can also parallelize. So we can have multiple consumers on the same topic, each reading different partitions. So it actually allows us to get more throughput out of the system
Inside a partition, we see that each message has an ID. Those IDs are called offsets When we consume messages, we say I want to consume messages from this offset and continue on. And then if, for example, we break the connection and we want to start again, we can decide which offset we want to start with. And Kafka preserves the order.
So producer writes into a topic, it writes messages in a certain order, and then it gets all those offsets in the same order. And we read them and we know that we are getting them in the same order.
So we can see partitions 0, 1, and 2. each partition has multiple replicas. That’s how we do high availability. So the partition has copies on different servers. If the server goes away, we roll to a different topic.
Components: Kafka Producer
- pushes messages to a Kafka topic. Also responsible for choosing which record to assign to which partition within the topic.
Components: Kafka Consumer
- subscribes to a topic pulls/ reads and processes messages from the topic.
Components: Kafka Broker
- Broker known as Servers/Instances manages the storage of messages and exchange of messages. In a cluster, there are multiple Brokers
Components: Kafka Zookeeper
- To offer the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election, Kafka uses Kafka zookeeper.
Features of Kafka
Like I said before, Kafka is not just a messaging system. It has more to it which makes it more standout from the other platforms. Following are some of those features.
- Scalability :
Since Kafka has a distributed architecture, it gives the capability of scaling horizontally. This is achieved by partitioning the topic and distributing it to different nodes. Another thing that helps with scalability is Consumer groups. These two concepts combined together to ensure that there is no down time during a scaling process.
- Fault Tolerance
If there is N replicas in the cluster, the Kafka cluster can tolerate upto N-1 failures. This ensures the availability as well as recovery of data in a dead Controller scenario
- Zero Data Loss
Since writing process is synchronous and acknowledgement is sent to the client after the replication process, Kafka cluster guarantees that no data is lost.
- Durability
Kafka cluster is writing the data into the disk in each transaction, hence, its data is durable. This feature is total contradiction with compared to Redis clustering.
- Stream Processing
Kafka’s specialty lies in processing the streams of data. They have a special dedicated Stream API for this purpose. It has many functionalities like aggregations, joins, etc.
Setting up a Kafka server on Ubuntu
Prerequisite: JVM
Download the Kafka from here or by the following command in terminal
wget https://www.apache.org/dyn/closer.cgi?path=/kafka/2.2.0/kafka_2.12-2.2.0.tgz
Let’s extract the file by the following command
tar -xzf kafka_2.12-2.1.1.tgz
cd into extracted directory and start the zookeeper first
bin/zookeeper-server-start.sh config/zookeeper.properties
Let's start Kafka server
bin/kafka-server-start.sh config/server.properties
Create a Topic
I am creating a topic called TutorialTopic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TutorialTopic
Let’s see the topic
bin/kafka-topics.sh --list --zookeeper localhost:2181
Let’s Push a message to the TutorialTopic we created
echo "Hello, World" | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TutorialTopic > /dev/null
Let’s put another message to check in which order it reads later. with the following message “Hello, from Kekayan”
Let’s check/Pull the Topic to see the message
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic --from-beginning
as following image shows by running both sides by side we can see the how producer-consumer works .when we push it appears in consumers console
Setting Up a Kafka Cluster
we have played with a single broker, the real fun is playing with multiple cluster.let’s do it. First, we create the config files for the other two servers.
cp config/server.properties config/server-1.properties
cp config/server.properties config/server-2.properties
edit config for both new configs as follows
config/server-1.properties:
broker.id=1listeners=PLAINTEXT://:9093log.dirs=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2listeners=PLAINTEXT://:9094log.dirs=/tmp/kafka-logs-1
The broker.id
property is the unique and permanent name of each node in the cluster. We have to change the port and log directory only because we are running these all on the same machine.
now start these two servers
bin/kafka-server-start.sh config/server-1.properties
bin/kafka-server-start.sh config/server-2.properties
Let’s create a new topic with replication factor 3 since we have 3 servers now.I name it as clusterTopic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic clusterTopic
Let’s check the topic
bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic clusterTopic
Let’s create another topic with 2 partitions
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 2 --topic clusterPartionTopic
Now we can use one as producer and others and itself as the consumer to check the cluster
#create a producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic clusterPartionTopic#consumer 1
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic clusterPartionTopic --from-beginning
#consumer 2
bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic clusterPartionTopic --from-beginning
#consumer 3
bin/kafka-console-consumer.sh --bootstrap-server localhost:9094 --topic clusterPartionTopic --from-beginning
happy clustering :)
PS: I got great help and information from official site check it for further https://kafka.apache.org/quickstart