Eureka Server集群重启问题追踪 - HelloWorld开发者社区

问题

在生产环境重启Eureka Server集群的时候，发现订单客户端调用分布式Id生成服务出错，

1Caused by: com.netflix.client.ClientException: Load balancer does not have available server for client: IDG

显示订单服务调不到IDG服务了

问题思考

Eureka Client缓存由一个定时线程去刷新，每30秒执行一次增量更新，ribbon每30秒从Eureka Client的本地缓存里面获取服务的信息，上面的错误，是有ribbon报出来的，说明ribbon里面IDG服务的信息不存在，通过后续调试，发现Eureka Client的本地缓存是空的。由此引发了一个问题，当Eureka Server正在重启或者重启完成，Eureka Client来获取注册信息，然后更新到本地出了问题

问题追踪

检查Eureka Client日志

12018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - Got delta update with apps hashcode 22018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The total number of instances fetched by the delta processor : 032018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The Reconcile hashcodes do not match, client : UP_5_, server : . Getting the full registry42018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG c.n.discovery.shared.MonitoredConnectionManager - Get connection: {}->http://server1:7010, timeout = 500052018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - [{}->http://server1:7010] total kept alive: 1, total issued: 1, total allocated: 2 out of 20062018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - Getting free connection [{}->http://server1:7010][null]72018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG org.apache.http.impl.client.DefaultHttpClient - Stale connection check

发现如上片段的日志，当客户端的CacheRefreshExecutor（缓存刷新线程池）执行任务的时候

第1行：获取增量更新数据的hashCode

第2行：获取到的增量数据总数为0

第3行：节点合并之后，增量数据（服务端）的HashCode和本地client端的HashCode不一致， client = UP_5_ , Server = “” ，因此需要发起全量获取

第4..7行：发起全量获取。

发生问题的原因已经很明显了，就是在Eureka Server重启的时候，注册信息为空，刚好被Eureka Client获取到，由于HashCode计算不一致

导致发起全量获取，然后覆盖本地的缓存数据。导致本地的缓存数据更新为错误的，由此发生调用问题。

通过检查Eureka Server的配置，发现如下问题：

1eureka:2  instance:3      hostname: server24  client:5    serviceUrl:6      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/7    fetch-registry: false 8    register-with-eureka: true   // 将自身注册到Eureka 集群上面去

fetch-registry = false , 这就表明当Eureka Server作为Client注册到Eureka集群上面去的时候，默认是不会去全量抓取注册信息的。但是Eureka Server作为服务端的时候，在服务刚刚启动的时候，会从本地client获取注册信息（
register-with-eureka: true时，他本身也作为客户端注册到Eureka上去了），然后注册到自身的服务上去。想了解具体详情可以看：深入理解Eureka Server集群同步（十）

也就是说Eureka Server刚刚启动的时候，他作为server端的注册信息是空的。只能依赖后续集群续约同步的方式，慢慢补全自身的信息。

通过上面的了解，将配置修改成下面这样：

1eureka:2  instance:3      hostname: server24  client:5    serviceUrl:6      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/7    fetch-registry: true 8    register-with-eureka: true   // 将自身注册到Eureka 集群上面去

将fetch-register修改为true，这样在Eureka Server 刚刚启动的时候，就可以将注册信息全部注册到自己的节点上去。

通过并发测试，发现刚刚那个配置只是减小了几率，并不能做到完全避免，原因如下：

 1protected void initEurekaServerContext() throws Exception { 2   // .....省略N多代码 3   // 从其他服务同步节点 4   int registryCount = this.registry.syncUp(); 5    // 修改eureka状态为up 同时，这里面会开启一个定时任务，用于清理 60秒没有心跳的客户端。自动下线 6   this.registry.openForTraffic(this.applicationInfoManager, registryCount); 7 8   // .....省略N多代码 9   EurekaMonitors.registerAllStats();10}1112@Override13public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {14    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.15    // 计算每分钟最大续约数16    this.expectedNumberOfRenewsPerMin = count * 2;17    // 每分钟最小续约数18    this.numberOfRenewsPerMinThreshold =19            (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());20    logger.info("Got " + count + " instances from neighboring DS node");21    logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);22    this.startupTime = System.currentTimeMillis();23    if (count > 0) {24        this.peerInstancesTransferEmptyOnStartup = false;25    }26    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();27    boolean isAws = Name.Amazon == selfName;28    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {29        logger.info("Priming AWS connections for all replicas..");30        primeAwsReplicas(applicationInfoManager);31    }32    logger.info("Changing status to UP");33    // 设置实例的状态为UP34    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);35    // 开启定时任务，默认60秒执行一次，用于清理60秒之内没有续约的实例36    super.postInit();37}

从上面的代码粗略上来看，没有什么问题，假如存在下面这种情况

1Eureka Client    增量同步2Eureka Server    同步集群节点数据

当Eureka Server还没有同步完成节点数据的时候， Eureka Client就过来拉取数据了，如此，Eureka Client拉取到的

就是不完整的或者是空的数据，这样还是会造成上面的问题，只不过几率比较小、

完整解决方案

修改配置文件

1eureka:2  instance:3      hostname: server14      initial-status: STARTING5  client:6    serviceUrl:7      defaultZone: http://server2:7011/eureka/,http://server3:7012/eureka/8    fetch-registry: true 9    register-with-eureka: true

添加eureka.instance.initial-status: STARTING 表示在Eureka Server 刚刚启动的时候，默认不主动去注册，等待服务同步数据完成之后

再去注册。

自定义过滤器

1public void doFilter(ServletRequest request, ServletResponse response,2                     FilterChain chain) throws IOException, ServletException {3    InstanceInfo myInfo = ApplicationInfoManager.getInstance().getInfo();4    InstanceStatus status = myInfo.getStatus();5    if (status != InstanceStatus.UP && response instanceof HttpServletResponse) {6        throw  new RuntimeException("Eureka Server status is not UP ,do not provide service ");7    }8    chain.doFilter(request, response);9}

自定义过滤器，当Eureka Server的状态不是UP的时候，不对外提供服务。只有当Eureka Server启动完成并且同步数据完成

才会修改状态为UP，防止Eureka Client获取到不完整的数据。

 1@Bean 2public CustomerStatusFilter statusFilter(){ 3 4    return  new CustomerStatusFilter(); 5} 6@Bean 7public FilterRegistrationBean someFilterRegistration() { 8 9    FilterRegistrationBean registration = new FilterRegistrationBean();10    registration.setFilter(statusFilter());11    registration.addUrlPatterns("/*");12    return registration;13}

弊端：加入这个过滤器，如果在集群完全没有启动的时候，一台一台的启动的话，默认需要150秒才可以正常提供服务。

本文分享自微信公众号 - sharedCode（sharedCode）。
如有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一起分享。