kafka topic, partition description (producer, consumer, replication- factor, offset)

Hello, this is codeshow.
In this time, we will learn about topic and partition in kafka .
To use kafka, you must have a topic.
Based on this topic, producer publish messages,
consumer subscribes to messages.
Run kafka to create a topic .
Please refer to the previous video for environment settings.
Run devcontainers for kafka practice.
Wait until the container runs.
When the container is finished running, we will open docker desktop.
Execute shell in kafka container .
For topic lookup, use the list option of kafka topics .

kafka-topics --bootstrap-server kafka:9092 --list

I confirmed that there is no topic except for topic consumer offsets and schemas created by default.
Create a topic using the create option of the kafka- topics command.

kafka-topics --bootstrap-server kafka:9092 --create --topic hello --partitions 1 --replication-factor 1

I created a topic with the name hello with the kafka topics command.
The name of the topic is specified using the name option.
The replication factor option is an option to select how many copies of partition data to make.
I set the replica factor value as many as the number of kafka nodes.
Higher values incur the cost of copying partitions between nodes, which affects performance.
Instead, if the value is small, data may be lost in case of kafka failure.
A value of 3 or higher is usually recommended.

For reference, an error occurs if you put a replication factor greater than the number of kafka nodes at the time of create .

kafka-topics --bootstrap-server kafka:9092 --create --topic error --partitions 1 --replication-factor 4

I will explain partitions with another option .
partition is very important information in kafka.
The partitions option is information that sets how many topic to divide into.
If a producer publish 100 topic and this topic only has 1 partition,
100 topic are stacked in 1 partition.
If a consumer takes 1 minute to process 1 topic.
It will take 100 minutes to process 100 topic .
If you set the partition to 4, topic are stacked by 25 in each of the 4 partitions.
Since one partition can have at most one consumer,
Now you can add 3 consumer as much as the number of partitions increased.
Processing time can be reduced by 1/4 from 100 minutes to 25 minutes.

We will practice the contents so far using the shell.
I will practice producer and consumer on hello topic .
Run terminal in kafka container .
Place the producer on the left and the consumer on the right.
Connect to kafka using the kafka console producer command.

kafka-console-producer --bootstrap-server kafka:9092 --topic hello

producer is connected to kafka broker.

Connect to kafka using the kafka console consumer command.
partition increases sequentially from index 0.
In practice, since the partition size is 1, only partition 0 can be used.
Enter the partition number 0 as option .

kafka-console-consumer --bootstrap-server kafka:9092 --topic hello --partition 0

Enter hello message for producer .
You can check hello in consumer .

So, can multiple consumer connect to one partition ?
Let’s run an additional terminal and add a consumer.

kafka-console-consumer --bootstrap-server kafka:9092 --topic hello --partition 0

consumer connects without problems.

So what happens when you publish a message as a producer ?
There are two scenarios.

Messages are delivered sequentially, once in the order of consumer .
This is called the round robin method.
All messages are delivered to all consumer.
The name is called broadcasting.

Enter message hello1 as producer to check operation.
A hello1 message appears on both consumer.
Enter the hello2 message again as producer.
A hello2 message appears on both consumer.
In kafka , all consumer connected to the same partition receive messages through broadcasting .

For reference, rabbit mq, another message broker , can connect multiple worker to one queue .
And with round robin, worker receive messages sequentially.
This part works differently from kafka, so those who have experienced rabbit mq first can be confused.
rabbit mq increases the throughput by increasing the size of worker in one queue ,
kafka increases the throughput by increasing the partition size of one topic .

In this exercise, we will change the number of partition.
Let’s increase the number of partition from 1 to 2.
Exit the producer by pressing ctrl c.
Let’s increase the partition size to 2 using the alter option of the kafka topics command.

kafka-topics --bootstrap-server kafka:9092 --alter --partitions 2 --topic hello

Check the partition information with the describe option of kafka topics .

kafka-topics --bootstrap-server kafka:9092 --topic hello --describe

For reference, you can also check the increased partition size of 2 on the topic screen of AKHQ .
Execute producer again and add consumer to the newly added partition #1.

kafka-console-producer --bootstrap-server kafka:9092 --topic hello

Terminate the consumer connected to the existing partition 0, and connect the consumer to the new partition 1.

kafka-console-consumer --bootstrap-server kafka:9092 --topic hello --partition 0
kafka-console-consumer --bootstrap-server kafka:9092 --topic hello --partition 1

Create a random number of messages with producer.
There are two consumer, but the topic is divided and delivered to only one consumer.

This is the sticky partition cache way kafka works when there is no key .
If there is no key, the first sent partition number is cache and the topic is forwarded only to this partition .
If there is no key , it is meaningless to increase the partition because only one partition is used.
In other words, key is important information that determines the partition in which topic will be stored in kafka .
We will add a key so that topic are well divided and stored between partition .

Add property option to producer and run it again.

kafka-console-producer --bootstrap-server kafka:9092 --topic hello --property "key.separator=-" --property "parse.key=true"

Let’s put a random value as producer.
Now you can see that the topic is divided into partition and saved.
In particular, you can check that the same key is publish to the same partition through the consumer.

Using the key separator option, the key before the dash and the value after it are publish.
a is the key and b is the value.
For reference, in the past, only the value hello was publish without a key .

If the values preceding the dash are the same, the key is the same.
Importantly, the same key is always stored in the same partition.
However, if the key are different, they are stored in the same partition or different partition.
To understand this part, let’s look at how to handle key in kafka .
kafka hash the key and calculates the remainder as many as the number of kafka nodes, and determines the result value as the partition number.
Since there are now two partition, only 0 and 1 can appear as partition numbers.
For reference, the hash algorithm uses murmur2 hash.

An important point to understand about partition is that kafka only guarantees the order of topic on a partition basis.
That is, when publish topic whose order is important, put them in the same partition.
I checked that the key value is always stored in the same partition,
For topic where order is important, be sure to use the same key!

Next, I will explain the offset of the partition .
kafka differs from the existing message queue in that it does not delete messages from the queue after consume .

Partitions are similar to arrays.
The topic in a partition are stored in order, like the elements of an array.
Just as you access an array value by index, you can consume from a specific index of the partition through offset .
For reference, each partition has the last offset.
So, if a new consumer is connected without the offset option , it consume from the last offset .
However, if an offset option is given, offset information is consume first.

As an exercise, we will issue two commands to the terminal.

Command the consumer to execute from offset 0.
Command to execute consumer without offset command.

kafka-console-consumer --bootstrap-server kafka:9092 --topic hello --partition 1 --offset 0
kafka-console-consumer --bootstrap-server kafka:9092 --topic hello --partition 1

If you give the offset index 0 option, you can check that it is executed again from the beginning.
If the offset option is not present, the offset option defaults to latest.
consume only newly publish topic .

Finally, as an important constraint on partition,
You cannot put a smaller value for partition size.
Only larger values can be entered for partition size.
I will give the alter command to reduce the partition from 2 to 1.

kafka-topics --bootstrap-server kafka:9092 --alter --partitions 1 --topic hello

You can see an error message telling you to set the partition option to a value greater than 2.

This concludes the explanation of kafka’s topic and partition.
Next time, we will look at how to use consumer groups.
For reference, rather than the consumer directly consume the partition ,
consumer groups are used more in practice.
In order to better understand these consumer groups, this class
I had time to learn partition.

Subscribe and like notification settings are very helpful for content creators.

thank you