这是一个从事发到目前为止我没有从根本上解决的技术问题,也是我心中的一个非常大的疑惑。 写于: 2020-06-14 周日,下午14:00
问题已解决,解决过程和方案可以看文章最末尾,解决方案写于 2020-06-20 15:26
问题一: 2020年6月4号上午10点左右(高峰期),A集群某节点,我们姑且称之为37号节点,CPU增长至100%居高不下,导致查询队列积压,大量查询请求被拒绝,扩容后问题解决(我们之后新增了五个节点),但是扩容之后的集群里,37号节点平时的CPU占用还是要远远高于其他节点, 相差10倍左右,也就是说其它节点在闲置着看戏,但是扩容后至少不会出现CPU占满的情况。
问题二: 2020年6月12号上午10点左右(高峰期),B集群某节点,我们姑且称之为2号节点,CPU增长至100%居高不下,导致查询队列积压,大量查询请求被拒绝,扩容后问题解决,但是2号节点平时的CPU占用要远远高于其他节点, 相差10倍左右。 我们排查出第三方接口查询量突增非常多的量,因为时间紧急,我们熔断了第三方的查询接口,超过N的QPS直接返回错误信息,2号节点的CPU直线下跌至正常状态,其它节点在这期间闲置看戏。
问题一,经过排查,我们检查了当时37号节点的ES进程里占用非常高的线程情况:
"elasticsearch[logsearch-datanode-36][index][T#14]" #273 daemon prio=5 os_prio=0 tid=0x00007ef8e8084800 nid=0x6f62 runnable [0x00007ef4d37eb000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:582)
- locked <0x00000000c3c36900> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:558)
at org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:641)
at org.elasticsearch.action.delete.TransportDeleteAction.executeDeleteRequestOnPrimary(TransportDeleteAction.java:143)
at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:136)
at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:54)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae17720> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#13]" #272 daemon prio=5 os_prio=0 tid=0x00007ef8f4085000 nid=0x6f61 runnable [0x00007ef4d38ec000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
- locked <0x00000001619253d8> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae10650> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#12]" #271 daemon prio=5 os_prio=0 tid=0x00007ef8f8083800 nid=0x6f60 runnable [0x00007ef4d39ed000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
- locked <0x0000000156f6a138> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae1b570> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#11]" #270 daemon prio=5 os_prio=0 tid=0x00007ef8e0003000 nid=0x6f5f runnable [0x00007ef4d3aee000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:467)
at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298)
at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:184)
at java.lang.ThreadLocal.get(ThreadLocal.java:170)
at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:481)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:450)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae15d78> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#10]" #269 daemon prio=5 os_prio=0 tid=0x00007ef910083800 nid=0x6f5e runnable [0x00007ef4d3bee000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:582)
- locked <0x000000016c7c8a98> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:558)
at org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:641)
at org.elasticsearch.action.delete.TransportDeleteAction.executeDeleteRequestOnPrimary(TransportDeleteAction.java:143)
at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:136)
at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:54)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae19da0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#9]" #267 daemon prio=5 os_prio=0 tid=0x00007ef8f0082800 nid=0x6f5c runnable [0x00007ef4d3df0000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
- locked <0x0000000156f71bf8> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae178c8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#8]" #266 daemon prio=5 os_prio=0 tid=0x00007ef8e8082800 nid=0x6f5b runnable [0x00007ef4d3ef1000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:481)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:450)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae19f48> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#7]" #265 daemon prio=5 os_prio=0 tid=0x00007ef8f4083000 nid=0x6f5a runnable [0x00007ef4d3ff3000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:467)
at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298)
at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:184)
at java.lang.ThreadLocal.get(ThreadLocal.java:170)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.apache.lucene.index.CodecReader.getNumericDocValues(CodecReader.java:143)
at org.apache.lucene.index.FilterLeafReader.getNumericDocValues(FilterLeafReader.java:430)
at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.<init>(PerThreadIDAndVersionLookup.java:77)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:83)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
- locked <0x00000001aa0732e0> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae1b718> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#6]" #264 daemon prio=5 os_prio=0 tid=0x00007ef8f8082000 nid=0x6f59 runnable [0x00007ef4d40f4000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
- locked <0x00000000c3c36b70> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae15f20> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[logsearch-datanode-36][index][T#5]" #263 daemon prio=5 os_prio=0 tid=0x00007ef8e0001000 nid=0x6f58 runnable [0x00007ef4d41f5000]
java.lang.Thread.State: RUNNABLE
at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
at java.lang.ThreadLocal.get(ThreadLocal.java:163)
at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
- locked <0x0000000156f69b08> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000008ae1a0f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
起初我是怀疑是不是 ThreadLocal里的ThreadLocalMap导致的,为此又恶补了下 ThreadLocalMap 的原理,当存储线程变量的时候,它会分配一个数组给这个线程用,变量放进去的时候用ThreadLocal的这个变量的hash值定位到在数组的那个位置去,如果有冲突则找这个位置的下一个位置。
这是设置值的时候,如果在get的时候也是通过ThreadLocal的hash去找到数组的位置,冲突了之后找当前位置的下一个位置,如果下一个位置的这个key已经是null了说明被回收了,那么要进行一次清理,它会向后清理掉为空的值,这就是所谓的启发式清理,没有固定的线程去做这个,靠着查询的时候出现空值来进行清理。
本来我怀疑这里有问题,因为当冲突过多的时候,或者这个数组太大了,因为冲突多了,就会扩容这个数组,导致每次冲突就要往下寻找数据,这是一个循环操作非常的损耗CPU,会不会是这个数组快要满了但是又没有满,导致了查询一直在循环的寻找这个ThreadLocal,后来我在本地起了20个线程,跑了40亿个数据来验证这个观点,每一千万次查询和插入操作,只会出现三四个这种耗时达到100ms以上的情况,且耗时长的那三四个操作都维持在200ms左右,这对性能理论上没有影响。
问题二:
后来又觉得会不会是数据不均衡导致的,我们看了下数据分片存储的数据大小,几乎都是一致的,只有1G的差别大小,这可以忽略不计。
又怀疑是系统核心参数配置的问题导致的,那台机器的CPU配置是节能模式,我们改为了高性能模式,也没有太大的变化。
到这里又发现了一个疑点,所有节点进来的流量是一样的,但是唯独那台机器出去的流量是别的节点的10倍。
是否是某个查询的结果落到了2号节点,那个数据分片正好都在2号,那个查询结果的量特别特别的大,但是我们粗略的也没看出来有哪个查询的量特别庞大。
最后猜测是否是机器原因,别的机器都正常,唯独那两个集群的那两个节点存在问题。
两个集群的ES版本都比较低,2.3.4
问题节点的机器是物理机,配置很高。
记录下,后续再看下怎么解决。
问题已解决,写于 2020-06-20 15:26
6-16号晚上我们将CPU高的那台36机器下线了,集群就变成平稳且健康了。
下线的这台36物理机器从配置上来看是非常高的,内存60G以上,CPU是32核,但是下线了这个CPU高的节点后就恢复了。另外一个集群也把CPU飙高的02节点下线了,集群整体非常稳定。
这里猜测原因是第一,这两个节点的硬件或者系统存在问题,第二是ES 2.3.4版本对集群内部的机器异构支持很差(集群内部机器的硬件配置不一致)。
事发整个实际过程回顾
2020-06-04,快递员集群出现CPU负载过高的问题,37和36节点都出现CPU很高,其他节点闲置。
2020-06-04 10:00 到 2020-06-04 12:00 出现的问题:件已经入柜,但是FC管家查询不到信息,不会有明显影响,只是写入ES的速度很慢,kafka消息积压,等到恢复后,受影响的积压数据自动就写入了。
2020-06-04 13:16分左右,37节点突然下降到正常水平,36节点在持续CPU负载过高。
2020-06-04 14:30左右,36节点瞬间恢复正常水平。
2020-06-04 下午17:10左右,36节点的CPU又达到90%以上,37稍微有点高,但是还是健康状态。
2020-06-04 下午17:10 到 2020-06-04 18:30 出现的问题:件已经取出,但是FC管家查询到的数据还是未取状态,不会有明显影响,只是写入ES的速度很慢,kafka消息积压,等到恢复后,受影响的积压数据自动就写入了。
2020-06-04 下午18:55左右,36节点的CPU瞬间由高点下降到正常水平.
2020-06-04 晚上20:00 开始扩容,加入了6台物理机,晚上22:40分左右完成扩容。
2020-06-05 早上10:00 一切正常,36节点略微有点偏高但是处于正常水平,CPU负载20%左右,CPU负载明显高于其他节点。
2020-06-12 10:00 左右 消费者集群 post_consumer_data_02节点出现CPU负载变高,我们称之为189节点。
2020-06-12 10:20左右,客服和全员群上报微信支付分订单问题,原因:微信支付分查询使用的是消费者集群的索引数据。
2020-06-12 10:30左右开放平台报大量请求查询变慢甚至查询不出来结果,经研发排查,发现是第三方(华为)接口查询量大增,查询流量进入消费者集群。
2020-06-12 10:45左右, 紧急熔断掉第三方调用我方流量最大的查询接口。
2020-06-12 10:50左右,189节点的CPU在10:52从99%瞬间回落至8%,业务恢复正常,后续问题上报存在时间差,其实已经恢复。
2020-06-12 11:00 ,和第三方沟通查询量限制,但是189节点CPU变高的原因尚未找到。
2020-06-13 周六,继续熔断第三方接口,限流。 2020-06-14 周日,继续熔断第三方接口,限流。
2020-06-15 周一晚上,如果继续限流,PDD和HW等第三方都会来投诉,15号晚上我们把第三方的查询接口由消费者集群切换到了快递员集群。 基于2020-06-04那天晚上快递员集群已经扩容,我们把流量导向这个性能更强大的集群,取消限流。
2020-06-16 周二早上10:19分,快递员集群的36节点CPU开始变高,其他节点闲置。客服系统、工单系统、FC管家查询派件信息卡顿、缓慢、查不到结果。
2020-06-16 11:30,对第三方查询接口进行限流,36节点CPU瞬间恢复。
2020-06-16 中午13:04分左右,对37节点的分片进行了迁移,切换了该节点上的两个分片到别的节点,效果不是非常明显。
2020-06-16 13:44,取消第三方查询接口的流量限制。
2020-06-16 15:37分左右,36节点CPU开始增长且上下波动剧烈。讨论ES快递员集群全部下线虚拟机节点,下掉出问题严重的36节点。
2020-06-16 23:50左右开始准备ES快递员集群的迁移,下掉所有的虚拟机节点,下掉问题节点36节点。
2020-06-17 凌晨04:27分,快递员集群14个数据节点 2个master节点 2个client节点 全部运行于物理机,36机器移除快递员ES集群。
2020-06-17 上午10:00,快递员集群正常,高峰期,37这个节点比起其它的节点还是弱了一些,没有36后它是最差的,但是处于健康可控状态。
2020-06-17 晚上22:55左右,下线之前出问题的消费者集群189节点。
2020-06-18 到 现在,集群处理稳定监控状态,快递员集群37节点的CPU负载要略高于所有其他节点两倍,但是处于健康可控状态。
结论:
快递员集群36节点和消费者189节点存在系统或者机器问题, 37节点也存在类似的问题,但是相对于差的来说要好一些。