Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. It is designed to handle real-time data streams efficiently and is widely used for building scalable and high-throughput data pipelines. It works as a broker between two parties based publish/subscribe messaging system . It can handle about trillions of data events in a day. we store the transactional data in database, so that we can retrieve it later to make some business decisions, Kafka also stores data in form of messages
The unit of data within Kafka is called a message. It can be XML, JSON, String or anything. Many Kafka developers favor the use of Apache Avro, which is a serialization framework originally developed for Hadoop. Kafka does not care and store everything.
A batch is a collection of messages, all of which are being produced to the same topic. Batching helps in optimizing the performance and throughput of the system.
Topic a common name used to store and publish a particular stream of data. Basically, topics in Kafka are similar to tables in the database, but not containing all constraints.
For example, consider we have topic with name “activity-log” which has 3 partitions with names:
activity-log-0
activity-log-1
activity-log-2
When a source system send messages to activity-log topic, these messages (1-n) can be stored in either of the partition based on load and various other factors.
A single Kafka server is called a broker. The broker receives messages from producer clients, assigns and maintain their offsets, and stores the messages in storage system.
High Throughput Support for millions of messages with modest hardware
Scalability Highly scalable distributed systems with no downtime
Replication Messages are replicated across the cluster to provide support for multiple subscribers and balances the consumers in case of failures
Durability Provides support for persistence of message to disk
Stream Processing Used with real-time streaming applications like Apache Spark & Storm
Data Loss Kafka with proper configurations can ensure zero data loss
Kafka uses ZooKeeper to manage the cluster. ZooKeeper is used to coordinate the brokers/cluster topology. Zookeeper sends changes of the topology to Kafka, so each node in the cluster knows when a new broker joined, a Broker died, a topic was removed or a topic was added, etc
A messaging system is exchange of messages between two or more devices. A sender is known as a producer who publishes messages, and a receiver is known as a consumer who consumes that message by subscribing it.
Stream Processing: Real-time data processing and analytics.
Website Activity Tracking: Monitoring and analyzing user interactions.
Metrics Collection and Monitoring: Aggregating and visualizing system performance metrics.
Log Aggregation: Consolidating logs from various sources for analysis.
Real-Time Analytics: Immediate analysis of streaming data.
Data Integration: Capturing and ingesting data into systems like Apache Spark or Hadoop.
CQRS and Error Recovery: Supporting command-query responsibility segregation and recovering from errors.
Distributed Commit Log: Providing a reliable commit log for in-memory computing.
Kafka and JMS both are messaging system. Kafka is a based on these two concepts - it allows scaling between members of the same consumer group, but it also allows broadcasting the same message between many different consumer groups. Kafka also provides automatic rebalancing when new consumer join or left the consumer group.
JMS (Java message service) is an Java API. It is used for implementing messaging system in your application. JMS supports queue and publisher /subscriber(topic) messaging system. With queues, when first consumer consumes a message, message gets deleted from the queue and others cannot take it anymore. With topics, multiple consumers receive each message but it is much harder to scale.