Apache kafka
这,仅是我学习过程中记录的笔记。确定了一个待研究的主题,对这个主题进行全方面的剖析。笔记是用来方便我回顾与学习的,欢迎大家与我进行交流沟通,共同成长。不止是技术。
Kafka介绍与面向MQ编程模式介绍
springboot整合Apache Kafka
面向MQ的编程方式:
http://kafka.apache.org/intro 官网,接下来的要讲的东西都在这里边。
Introduction-介绍
Apache Kafka® is a distributed streaming platform. What exactly does that mean?
A streaming platform has three key capabilities:
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Store streams of records in a fault-tolerant durable way.
- Process streams of records as they occur.
ApacheKafka®是一个分布式流平台。 这到底是什么意思呢?
流平台具有三个关键功能:
发布和订阅记录流,类似于消息队列或企业消息传递系统。
以容错的持久方式存储记录流。
处理记录流。
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
Kafka通常用于两大类应用程序:
建立实时流数据管道,以可靠地在系统或应用程序之间获取数据
构建实时流应用程序,以转换或响应数据流
要了解Kafka如何执行这些操作,让我们从头开始深入研究Kafka的功能。
First a few concepts:
- Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
首先几个概念:
Kafka在一个或多个可以跨越多个数据中心的服务器上作为集群运行。
Kafka集群将记录流存储在称为主题的类别中。
每个记录由一个键,一个值和一个时间戳组成。
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
Kafka具有四个核心API:
Producer API允许应用程序将记录流发布到一个或多个Kafka主题。
消费者API允许应用程序订阅一个或多个主题并处理为其生成的记录流。
Streams API允许应用程序充当流处理器,使用一个或多个主题的输入流,并生成一个或多个输出主题的输出流,从而有效地将输入流转换为输出流。
连接器API允许构建和运行将Kafka主题连接到现有应用程序或数据系统的可重用生产者或使用者。 例如,关系数据库的连接器可能会捕获对表的所有更改。
In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
在Kafka中,客户端和服务器之间的通信是通过简单,高性能,与语言无关的TCP协议完成的。 该协议已版本化,并与旧版本保持向后兼容性。 我们为Kafka提供了Java客户端,但是客户端支持多种语言。
Topics and Logs - 主题和日志
Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
首先,让我们深入探讨Kafka提供的记录主题的核心抽象。
主题是将记录发布到的类别或订阅源名称。 Kafka中的主题始终是多用户的; 也就是说,一个主题可以有零个,一个或多个消费者来订阅写入该主题的数据。
对于每个主题,Kafka集群都会维护一个分区日志,如下所示:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
每个分区都是有序的,不变的记录序列,这些记录连续地附加到结构化的提交日志中。 分别为分区中的记录分配了一个顺序ID号,称为偏移号,该ID号唯一标识分区中的每个记录。
Kafka集群使用可配置的保留期限持久保留所有已发布的记录(无论是否已使用它们)。 例如,如果将保留策略设置为两天,则在发布记录后的两天内,该记录可供使用,之后将被丢弃以释放空间。 Kafka的性能相对于数据大小实际上是恒定的,因此长时间存储数据不是问题。
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
实际上,基于每个消费者保留的唯一元数据是该消费者在日志中的偏移量或位置。 此偏移量由使用者控制:通常,使用者在读取记录时会线性地推进其偏移量,但是实际上,由于位置是由使用者控制的,因此它可以按喜欢的任何顺序使用记录。 例如,使用者可以重置到较旧的偏移量以重新处理过去的数据,或者跳到最近的记录并从“现在”开始使用。
这些功能的组合意味着Kafka的消费者非常便宜-他们来来去去对集群或其他消费者没有太大影响。 例如,您可以使用我们的命令行工具来“尾部”任何主题的内容,而无需更改任何现有使用者所消耗的内容。
日志中的分区有多种用途。 首先,它们允许日志扩展到超出单个服务器所能容纳的大小。 每个单独的分区都必须适合承载它的服务器,但是一个主题可能有很多分区,因此它可以处理任意数量的数据。 其次,它们充当并行性的单元-稍有更多。
Distribution-分布
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
日志的分区分布在Kafka群集中的服务器上,每台服务器处理数据并要求共享分区。 每个分区都跨可配置数量的服务器复制,以实现容错功能。
每个分区都有一个充当“领导者”的服务器和零个或多个充当“跟随者”的服务器。 领导者处理对分区的所有读写请求,而跟随者则被动地复制领导者。 如果领导者失败,则跟随者之一将自动成为新领导者。 每个服务器充当其某些分区的领导者,而充当其他分区的跟随者,因此群集中的负载得到了很好的平衡。
Geo-Replication-集群
Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.
Kafka MirrorMaker为您的集群提供地理复制支持。 使用MirrorMaker,可以在多个数据中心或云区域中复制消息。 您可以在主动/被动方案中使用它进行备份和恢复。 或在主动/主动方案中将数据放置在离您的用户更近的位置,或支持数据位置要求。
Producers-生产者
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
生产者将数据发布到他们选择的主题。 生产者负责选择将哪个记录分配给主题中的哪个分区。 可以以循环方式完成此操作,仅是为了平衡负载,也可以根据某些语义分区功能(例如基于记录中的某些键)进行此操作。 一秒钟就可以了解更多有关分区的信息!
Consumers-消费者
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
消费者使用消费者组名称标记自己,并且发布到主题的每条记录都会传递到每个订阅消费者组中的一个消费者实例。 使用者实例可以在单独的进程中或在单独的机器上。
如果所有使用者实例都具有相同的使用者组,那么将在这些使用者实例上有效地平衡记录。
如果所有使用者实例具有不同的使用者组,则每个记录将广播到所有使用者进程。
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
由两台服务器组成的Kafka群集,其中包含四个带有两个使用者组的分区(P0-P3)。 消费者组A有两个消费者实例,组B有四个。
但是,更常见的是,我们发现主题具有少量的消费者组,每个“逻辑订户”一个。 每个组均由许多使用者实例组成,以实现可伸缩性和容错能力。 这无非就是发布-订阅语义,其中订阅者是消费者的集群而不是单个进程。
在Kafka中实现使用的方式是通过在使用方实例上划分日志中的分区,以便每个实例在任何时间点都是分区“公平份额”的排他使用方。 Kafka协议动态处理了维护组成员身份的过程。 如果新实例加入该组,它们将接管该组其他成员的某些分区; 如果实例死亡,则其分区将分配给其余实例。
Kafka仅提供分区中记录的总顺序,而不提供主题中不同分区之间的记录。 对于大多数应用程序,按分区排序以及按键对数据进行分区的能力就足够了。 但是,如果您需要记录的总订单量,则可以使用只有一个分区的主题来实现,尽管这将意味着每个使用者组只有一个使用者流程。
Multi-tenancy-多租户
You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation
您可以将Kafka部署为多租户解决方案。 通过配置哪些主题可以产生或使用数据来启用多租户。 配额也有运营支持。 管理员可以在请求上定义和实施配额,以控制客户端使用的代理资源。 有关更多信息,请参阅安全性文档。
Guarantees-担保
At a high-level Kafka gives the following guarantees:
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
- A consumer instance sees records in the order they are stored in the log.
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
More details on these guarantees are given in the design section of the documentation.
在较高级别上,Kafka提供以下保证:
生产者发送到特定主题分区的消息将按其发送顺序附加。 也就是说,如果记录M1是由与记录M2相同的生产者发送的,并且首先发送M1,则M1的偏移量将小于M2,并在日志中更早出现。
使用者实例按记录在日志中的存储顺序查看记录。
对于具有复制因子N的主题,我们最多可以容忍N-1个服务器故障,而不会丢失任何提交给日志的记录。
在文档的设计部分中提供了有关这些保证的更多详细信息。
Kafka as a Messaging System-消息传递系统
How does Kafka's notion of streams compare to a traditional enterprise messaging system?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
Kafka作为消息传递系统
Kafka的流概念与传统的企业消息传递系统相比如何?
传统上,消息传递有两种模型:排队和发布-订阅。 在队列中,一组使用者可以从服务器中读取内容,并且每条记录都将转到其中一个。 在发布-订阅记录中广播给所有消费者。 这两个模型中的每一个都有优点和缺点。 排队的优势在于,它允许您将数据处理划分到多个使用者实例上,从而扩展处理量。 不幸的是,队列不是多用户的—一次进程读取了丢失的数据。 发布-订阅允许您将数据广播到多个进程,但是由于每条消息都传递给每个订阅者,因此无法扩展处理。
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka has stronger ordering guarantees than a traditional messaging system, too.
卡夫卡的消费者群体概念概括了这两个概念。 与队列一样,使用者组允许您将处理划分为一组进程(使用者组的成员)。 与发布订阅一样,Kafka允许您将消息广播到多个消费者组。
Kafka模型的优点在于,每个主题都具有这些属性-可以扩展处理范围,并且是多订阅者-无需选择其中一个。
与传统的消息传递系统相比,Kafka还具有更强的订购保证。
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
传统队列将记录按顺序保留在服务器上,如果多个使用者从队列中消费,则服务器将按记录的存储顺序分发记录。 但是,尽管服务器按顺序分发记录,但是这些记录是异步传递给使用者的,因此它们可能在不同的使用者上乱序到达。 这实际上意味着在并行使用的情况下会丢失记录的顺序。 消息传递系统通常通过“专有使用者”的概念来解决此问题,该概念仅允许一个进程从队列中使用,但是,这当然意味着在处理中没有并行性。
卡夫卡做得更好。 通过在主题内具有并行性(即分区)的概念,Kafka能够在用户进程池中提供排序保证和负载均衡。 这是通过将主题中的分区分配给消费者组中的消费者来实现的,以便每个分区都由组中的一个消费者完全消费。 通过这样做,我们确保使用者是该分区的唯一读取器,并按顺序使用数据。 由于存在许多分区,因此仍然可以平衡许多使用者实例上的负载。 但是请注意,使用者组中的使用者实例不能超过分区。
Kafka as a Storage System-存储系统
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
For details about the Kafka's commit log storage and replication design, please read this page.
任何允许发布与使用无关的消息发布的消息队列都有效地充当了运行中消息的存储系统。 Kafka的不同之处在于它是一个非常好的存储系统。
写入Kafka的数据将写入磁盘并进行复制以实现容错功能。 Kafka允许生产者等待确认,以便直到完全复制并确保即使写入服务器失败的情况下写入也不会完成。
Kafka的磁盘结构可以很好地扩展使用-无论服务器上有50 KB还是50 TB的持久数据,Kafka都将执行相同的操作。
由于认真对待存储并允许客户端控制其读取位置,因此您可以将Kafka视为一种专用于高性能,低延迟提交日志存储,复制和传播的专用分布式文件系统。
有关Kafka的提交日志存储和复制设计的详细信息,请阅读此页面。
Kafka for Stream Processing-流处理
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
仅读取,写入和存储数据流是不够的,目的是实现对流的实时处理。
在Kafka中,流处理器是指从输入主题中获取连续数据流,对该输入进行一些处理并生成连续数据流以输出主题的任何东西。
例如,零售应用程序可以接受销售和装运的输入流,并输出根据此数据计算出的重新订购和价格调整流。
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
可以直接使用生产者和消费者API进行简单处理。 但是,对于更复杂的转换,Kafka提供了完全集成的Streams API。 这允许构建执行非重要处理的应用程序,这些应用程序计算流的聚合或将流连接在一起。
该功能有助于解决此类应用程序所面临的难题:处理无序数据,在代码更改时重新处理输入,执行状态计算等。
流API建立在Kafka提供的核心原语之上:它使用生产者和使用者API进行输入,使用Kafka进行状态存储,并使用相同的组机制来实现流处理器实例之间的容错。
Putting the Pieces Together-总结
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.
消息,存储和流处理的这种组合看似不寻常,但这对于Kafka作为流平台的角色至关重要。
A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.
像HDFS这样的分布式文件系统允许存储静态文件以进行批处理。 实际上,像这样的系统可以存储和处理过去的历史数据。
A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.
传统的企业消息传递系统允许处理将来的消息,这些消息将在您订阅后到达。 以这种方式构建的应用程序会在将来的数据到达时对其进行处理。
Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.
Kafka结合了这两种功能,对于将Kafka用作流应用程序平台和流数据管道平台而言,这种结合至关重要。
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.
通过结合存储和低延迟订阅,流应用程序可以以相同的方式处理过去和将来的数据(编程模型就一样了)。 那是一个单一的应用程序可以处理历史数据,存储的数据,而不是在到达最后一条记录时结束,而是可以在将来的数据到达时继续进行处理。 这是流处理的通用概念,它包含批处理以及消息驱动的应用程序。
Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.
同样,对于流数据管道,对实时事件的订阅组合使得可以将Kafka用于非常低延迟的管道。 但是可靠地存储数据的能力使得可以将其用于必须保证数据传输的关键数据,或与仅定期加载数据或可能停机很长时间进行维护的脱机系统集成。 流处理设施使得可以在数据到达时对其进行转换。
Kafka实例练习:
zookeeper介绍:
通过zookeeper启动kafka
Zookeeper 相关: Curator,是zookeeper包装后的API。提供了更多的更简洁的API
- 启动kafka之前,先进行一个简单的配置
- 启动了zookeeper之后再启动 kafka
举例: 通过生产者生产消息, 消费者消费消息。
先修改配置文件,进行简单的配置。
启动zookeeper
启动kafka
创建一个topic。
创建一个消费者
打开一个消费者
OK,可以完成通信了
kafka核心概念总结回顾
kafka操作命令演示:
查看当前的topic: bin/kafka-topics.sh --list --zookeeper localhost:2181
关于kakfa分区
关于kafka分区:
- 每个分区都是一个有序的不可变的消息序列,后续新来的消息会源源不断的,持续追加在分区的后边,这就相当于一种结构化的提交日志(类似于Git的提交日志)
- 分区中的诶一条消息都会被分配给一个连续的id值(即offset),该值用于唯一标识分区中的每一条消息。
分区的作用:
- 分区中的消息数据是存储在日志文件中的,而且同一分区中的消息数据是按照发送顺序严格有序的。分区在逻辑上对应一个日志,当生产者将消息写入分区时中时,实际上是写到了分区所对应的日志当中。而日志可以看做是一种逻辑上的概念,它对应于磁盘上的一个目录。一个日志文件由多个segment(段)来构成的,每个segment对应一个索引文件与一个日志文件。
- 借助于分区,我们可以实现kafka Server的水平扩展。对于一台机器来说,无论是物理机还是虚拟机,其运行能力总是有上限的。当一台机器的能力达到上限之后就无法扩展了即垂直扩展能力总是受到硬件制约的。通过使用分区,我们可以将一个主题中的消息分散到不同的kafka server上(这里需要使用kafka 集群)这样当机器的能力不足时,我们只需要添加机器就可以了,在新的机器上创建新的分区,这样理论上就可以实现无限的水平扩展能力。
- 分区还可以实现并行处理能力,向一个主题所发送的消息会发送给该主题所拥有的的不同的分区中,这样消息就可以实现并行的发送和处理,由多个分区来接收发送的消息。
Segment(段):
一个分区(partition)是由一系列有序,不可变的消息所构成的。一个partition钟的消息数量可能会非常多,因此显然不能将所有的消息都保存到同一个文件中,因此,类似于log4j的rolling log,当partition中的消息数量增长到一定程度之后,消息文件会进行切割,新的消息会被写到一个新文件中,当新的文件增长到一定程度后,新的消息又会被写到另一个新的文件中,以此类推,这一个个新的数据文件我们就称之为segment
因此,一个partition在物理上是由一个或多个segment所构成的,每个segment中则保存了真实的消息数据。
关于partition与segment之间的关系:
- 每个partition都相当于一个大型文件被分配到多个大小相等的segment数据文件中,每个segment中的消息数量未必相等(这与消息大小有着紧密的关系,不同的消息所占据的磁盘空间显然是不一样的),这个特点使得老的segment文件可以很容易就被删除掉,有助于提升磁盘的利用效率。
- 每个partition只需要支持顺序读写即可。segment文件的生命周期是由kafka server的配置参数决定的。比如说,server.properties文件中的参数项log.retention.hours=168 来管理保存的时间。
关于分区目录中4个文件的含义与作用:
- 0000000.index 文件:它是segment文件的索引文件,它与接下来我们要介绍的0000.log数据文件是成对出现的。后缀.index就表示这是哥索引文件
- 0000000.log :它是segment文件的数据文件,用于存储实际的消息。该文件是二进制格式的。segment文件的命名规则是partition全局的第一个segment从0开始,后续每个segment文件名为上一个segment文件最后一条消息的offset值,没有数字则用0填充。由于这里的主题的消息数量较少,因此只有一个数据文件。
- 0000000.timeindex:该文件是一个基于消息日期的索引文件,主要用途是在一些根据日期或是时间来寻找消息的场景下使用,此外在基于时间的日志rolling或是基于时间的日志保留策略等情况下也会使用。实际上,该文件是在kafka较新的版本中才增加的,老版本kafka是没有该文件的。它是*.index文件的一个有益补充。*.index文件是基于偏移量的索引文件。而.timeindex则是基于时间戳的索引文件。
- Leader-epoch-checkpoint:是leader的一个缓存文件。实际上,它是kafka的HW(High Water)与LEO(Log End Offset)相关的一个重要文件。
Kafka脚本重要命令
Kafka常用命令保存:
ZooKeeper初探
多消费者消费
主题删除
主题删除的时候不是直接删除的。看一下日志:(老版本不会删除日志,需要手动删除)
分区与主题
springboot 整合kafka
组成部分: 应该建立两个工程: 一个生产者,一个消费者。 面向MQ的编程
步骤:
1 引入依赖
2、
创建一个消息类型
创建一个生产者
创建一个消费者
4、yml 添加
5、创建 KafkaController
6、 启动 zookeeper 和 kafka
使用浏览器发送消息
Chrome 插件 : JSON Viewer
7、 改造项目,使用POST方法来操作
使用命令行 curl工具来模拟数据。
curl 工具介绍: Mac自带的一个小工具
➜ ~ which curl /usr/bin/curl
当前市面上的MQ
- Kafka
- ActiveMQ
- RabbitMQ
- RocketMQ
- ZeroMQ
你需要一个一个的都去学习吗?都去每个都学习怎么用?不,浪费时间。针对一个消息队列产品研究透,核心是差不多的。
撒花~over 待深入理解
2020年02月06日22:40:21 结束学习, 正文待上传图片后优化