Apache Kafka

Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. It is used to handle the real-time data storage. It works as a broker between two parties based publish/subscribe messaging system . It can handle about trillions of data events in a day. we store the transactional data in database, so that we can retrieve it later to make some business decisions, Kafka also stores data in form of messages

Message

The unit of data within Kafka is called a message. It can be XML, JSON, String or anything. Many Kafka developers favor the use of Apache Avro, which is a serialization framework originally developed for Hadoop. Kafka does not care and store everything.

Batch

A batch is a collection of messages, all of which are being produced to the same topic

Topics

Topic a common name used to store and publish a particular stream of data. Basically, topics in Kafka are similar to tables in the database, but not containing all constraints.

For example, consider we have topic with name “activity-log” which has 3 partitions with names:

  • activity-log-0

  • activity-log-1

  • activity-log-2

When a source system send messages to activity-log topic, these messages (1-n) can be stored in either of the partition based on load and various other factors.

Bokers

A single Kafka server is called a broker. The broker receives messages from producer clients, assigns and maintain their offsets, and stores the messages in storage system.

Features

  • High Throughput Support for millions of messages with modest hardware

  • Scalability Highly scalable distributed systems with no downtime

  • Replication Messages are replicated across the cluster to provide support for multiple subscribers and balances the consumers in case of failures

  • Durability Provides support for persistence of message to disk

  • Stream Processing Used with real-time streaming applications like Apache Spark & Storm

  • Data Loss Kafka with proper configurations can ensure zero data loss

Kafka Architecture


ZooKeeper

Kafka uses ZooKeeper to manage the cluster. ZooKeeper is used to coordinate the brokers/cluster topology. Zookeeper sends changes of the topology to Kafka, so each node in the cluster knows when a new broker joined, a Broker died, a topic was removed or a topic was added, etc

Messaging System

A messaging system is exchange of messages between two or more devices. A sender is known as a producer who publishes messages, and a receiver is known as a consumer who consumes that message by subscribing it.

Kafka Uses

  • Stream Processing

  • Website Activity Tracking

  • Metrics Collection and Monitoring

  • Log Aggregation

  • Real time analytics

  • Capture and ingest data into Spark / Hadoop

  • CRQS, replay, error recovery

  • Guaranteed distributed commit log for in-memory computing

Kafka vs JMS

Kafka and JMS both are messaging system. Kafka is a based on these two concepts - it allows scaling between members of the same consumer group, but it also allows broadcasting the same message between many different consumer groups. Kafka also provides automatic rebalancing when new consumer join or left the consumer group.

JMS (Java message service) is an Java API. It is used for implementing messaging system in your application. JMS supports queue and publisher /subscriber(topic) messaging system. With queues, when first consumer consumes a message, message gets deleted from the queue and others cannot take it anymore. With topics, multiple consumers receive each message but it is much harder to scale.