Kafka Rebanlace次数过高问题

那年烟雨落申城
• 阅读 301

Kafka Rebanlace次数过高问题

环境: Kafka Server 2.6.x Kafka Client Java 2.8.2

缘起:

最近发现Kafka Rebalance次数着实有点多,一天达到了六十多次,感觉不太正常,于是查了下日志发现:

Offset commit cannot be completed since 
the consumer is not part of an active group for auto partition assignment; 
it is likely that the consumer was kicked out of the group.

大意是某个kakfa client提交offset失败,因为已经在分组中下线。

为什么会下线?

我们来了解下什么情况下会掉线,常见情况如下:

1. 心跳原因:

kafka在n次心跳未收到后认为这个kafka client已经离线,于是server端会踢下线,至于n次是多少次,需要计算,有两个参数,一个是heartbeat.interval.ms,代表多久一次心跳,默认是3000ms,也就是3秒,还有一个参数是session.timeout.ms,代表保持session的超时时间,默认10000ms,也就是10秒。n = session.timeout.ms / heartbeat.interval.ms,也就是说3次之后不到第四次就会被踢下线,至于为什么不是正好3倍,官网解释是heartbeat.interval.ms的值建议小于session.timeout.ms1/3,两个参数官网解释如下:

session.timeout.ms The timeout used to detect client failures when using Kafka's group management facility. The client sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this client from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.msand group.max.session.timeout.ms. Type: int Default: 10000 (10 seconds)

heartbeat.interval.ms The expected time between heartbeats to the consumer coordinator when using Kafka's group management facilities. Heartbeats are used to ensure that the consumer's session stays active and to facilitate rebalancing when new consumers join or leave the group. The value must be set lower than session.timeout.ms, but typically should be set no higher than 1/3 of that value. It can be adjusted even lower to control the expected time for normal rebalances. Type: int Default: 3000 (3 seconds)

以上来自Kafka官网 https://kafka.apache.org/28/documentation.html#consumerconfigs

2. 拉取间隔原因

和这个原因有关的参数是max.poll.interval.ms,这个参数的意思是两次poll()操作之间如果超过了这个值,也会被服务端踢下线,默认300000ms,也就是300秒,5分钟。

max.poll.interval.ms The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member. For consumers using a non-null group.instance.id which reach this timeout, partitions will not be immediately reassigned. Instead, the consumer will stop sending heartbeats and partitions will be reassigned after expiration of session.timeout.ms. This mirrors the behavior of a static consumer which has shutdown. Type: int Default: 300000 (5 minutes)

以上来自Kafka官网 https://kafka.apache.org/28/documentation.html#consumerconfigs

定位

当时做性能优化的时候,这个kafka处理逻辑统计了时间于是找到了以下日志:

当前拉取了数据条数 10 耗时 411260ms thread: KafkaXxxReceiver-pool-3

处理10条数据居然用了411260ms,这是只是其中一条,通过模糊查询还找到了更多了超过300秒的数据,已经确认是这里的问题了。

优化思路

  1. 适当调大参数max.poll.interval.ms,或者调小每次拉取的消息数max.poll.records
  2. 因之前压测未出现此问题,需要进一步定位到底是哪一块用时较长,进行业务上的优化。
点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
Wesley13 Wesley13
3年前
vsftpd虚拟用户配置
关于vsftpd与mysql结合的问题,参考了网上很多的方案,结果到最后,都是loginincorrect,很是纳闷(配置基本参考这个:http://www.howtoforge.com/vsftpd\_mysql\_debian\_etch\_p2,数据库创建参考别的,简单的那种)不小心发现每次数据库链接,mysql的id都会1,于是发现及时
Stella981 Stella981
3年前
MacOS VSCode 安装 GO 插件失败问题解决
0x00问题重现Installinggolang.org/x/tools/cmd/guruFAILEDInstallinggolang.org/x/tools/cmd/gorenameFAILEDInstallinggolang.org/x/lint/golintFAILEDInst
Wesley13 Wesley13
3年前
FLV文件格式
1.        FLV文件对齐方式FLV文件以大端对齐方式存放多字节整型。如存放数字无符号16位的数字300(0x012C),那么在FLV文件中存放的顺序是:|0x01|0x2C|。如果是无符号32位数字300(0x0000012C),那么在FLV文件中的存放顺序是:|0x00|0x00|0x00|0x01|0x2C。2.  
Stella981 Stella981
3年前
SpringBoot整合Redis乱码原因及解决方案
问题描述:springboot使用springdataredis存储数据时乱码rediskey/value出现\\xAC\\xED\\x00\\x05t\\x00\\x05问题分析:查看RedisTemplate类!(https://oscimg.oschina.net/oscnet/0a85565fa
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
3年前
PHP创建多级树型结构
<!lang:php<?php$areaarray(array('id'1,'pid'0,'name''中国'),array('id'5,'pid'0,'name''美国'),array('id'2,'pid'1,'name''吉林'),array('id'4,'pid'2,'n
Easter79 Easter79
3年前
SpringBoot整合Redis乱码原因及解决方案
问题描述:springboot使用springdataredis存储数据时乱码rediskey/value出现\\xAC\\xED\\x00\\x05t\\x00\\x05问题分析:查看RedisTemplate类!(https://oscimg.oschina.net/oscnet/0a85565fa
Stella981 Stella981
3年前
Hibernate纯sql查询结果和该sql在数据库直接查询结果不一致
问题:今天在做一个查询的时候发现一个问题,我先在数据库实现了我需要的sql,然后我在代码中代码:selectdistinctd.id,d.name,COALESCE(c.count_num,0),COALESCE(c.count_fix,0),COALESCE(c
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_