当前位置: 首页 > article >正文

一次不接受官方建议导致的事故

记录一下
一次Elasticsearch集群事故分析、排查、处理

背景介绍

事故发生的ElasticSearch集群共有7台机器:

  • 10.163.204.193
  • 10.163.204.194
  • 10.163.204.195
  • 10.163.220.73
  • 10.163.220.74
  • 10.163.220.220
  • 10.163.220.221

其中193、194、195的机器配置一样,具体如下:

  • CPU:32核
  • 内存:128G
  • 磁盘:4T*3

系统盘单独挂载:40G

73、74、220、221的机器配置一样,具体如下:

  • CPU:32核
  • 内存:128G
  • 磁盘:10T

系统盘单独挂载:50G

以上7台机器用的都是阿里云的高效云盘,https://help.aliyun.com/zh/ecs/user-guide/disks-2
也就是说最大吞吐量(读+写 上限)为140MB/s

问题

由于比较穷,所以集群的部署情况比较复杂,这7台机器上共有一个Kafka集群、一个ZooKeeper集群、两个ElasticSearch集群。
其中Kafka、ZooKeeper集群部署在193、194、195上;
两个ElasticSearch集群各有7个节点(也就是这个7台机器每个都有两个ElasticSearch集群的实例);
193、194、195上的3块盘,Kafka、ZooKeeper、ElasticSearch(2个)集群都在用。

在这次事故之前发生过一次Kafka集群写入延迟,经分析是两个ElasticSearch集群与Kafka集群公用一块磁盘导致磁盘io满,造成Kafka集群集群延迟。

所以将193、194、195上的3块盘的使用调整为:

  • 一块4T的盘单独给Kafka、ZooKeeper使用
  • 另外两块4T的盘两个ElasticSearch集群混用(其中一个ElasticSearch集群的数据量,已经很小了,不到10G,所以可以忽略)

好了,前情回顾已经写完了,下面来说下这次出现的问题:

  • 23号下午5点,ElasticSearch集群挂了
    • 集群没有master
  • 25号上午9点,ElasticSearch集群又开始rebalance shards

分析

因为23号的问题很紧急,所以当时采取的方案就是重启整个集群,先恢复。

193上当时没有master的异常日志

{"type": "server", "timestamp": "2023-10-23T09:22:08,955Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "business-log", "node.name": "es-b-193", "message": "master not discovered or elected yet, an election requires at least 4 nodes with ids from [hYrrmhLHTx-QHoDmZ2wATg, jaNLhd1eT6SUYcLJkHpE1Q, 1VQFmt9jQ-6d7fCMjP-vnQ, uzGdH3VeRbOlcDIWeKdgIw, PhjlCea6TNKh4rdZPyPDkA, ZFQdNP4HSgCtPCnMfDdopw, 1WRaRyU-SuCPn8qbz3e4hg], have discovered [{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}, {es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw}, {es-b-194}{PhjlCea6TNKh4rdZPyPDkA}{6pQJDRevScad3hr_hJcnTg}{10.163.204.194}{10.163.204.194:9301}{cdfhilmrstw}, {es-b-220}{hYrrmhLHTx-QHoDmZ2wATg}{Qyz9MF2bRmKQq9uPWgTBuw}{10.163.220.220}{10.163.220.220:9301}{cdfhilmrstw}, {es-b-195}{1VQFmt9jQ-6d7fCMjP-vnQ}{e_P2NNXnTIWGzg7vE8EWkA}{10.163.204.195}{10.163.204.195:9301}{cdfhilmrstw}, {es-b-74}{uzGdH3VeRbOlcDIWeKdgIw}{fhNTddq7Syi7zPDVCejDWQ}{10.163.220.74}{10.163.220.74:9301}{cdfhilmrstw}, {es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}] which is a quorum; discovery will continue using [10.163.204.194:9301, 10.163.204.195:9301] from hosts providers and [{es-b-220}{hYrrmhLHTx-QHoDmZ2wATg}{Qyz9MF2bRmKQq9uPWgTBuw}{10.163.220.220}{10.163.220.220:9301}{cdfhilmrstw}, {es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}, {es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}, {es-b-74}{uzGdH3VeRbOlcDIWeKdgIw}{fhNTddq7Syi7zPDVCejDWQ}{10.163.220.74}{10.163.220.74:9301}{cdfhilmrstw}, {es-b-194}{PhjlCea6TNKh4rdZPyPDkA}{6pQJDRevScad3hr_hJcnTg}{10.163.204.194}{10.163.204.194:9301}{cdfhilmrstw}, {es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw}, {es-b-195}{1VQFmt9jQ-6d7fCMjP-vnQ}{e_P2NNXnTIWGzg7vE8EWkA}{10.163.204.195}{10.163.204.195:9301}{cdfhilmrstw}] from last-known cluster state; node term 43, last-accepted version 397615 in term 43", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }
{"type": "server", "timestamp": "2023-10-23T09:22:11,119Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "business-log", "node.name": "es-b-193", "message": "path: /_cat/nodes, params: {h=ip,name,heap.percent,heap.current,heap.max,ram.percent,ram.current,ram.max,node.role,master,cpu,load_1m,load_5m,load_15m,disk.used_percent,disk.used,disk.total}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" ,


org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];\",\n","stream":"stdout","time":"2023-10-23T09:21:34.490903538Z"}
{"log":"\"at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179

恢复后,排查日志,主要看到的现象就是193节点频繁的added、removed

added {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}, term: 43, version: 397620, reason: ApplyCommitRequest{term=43, version=397620, sourceNode={es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}{ml.machine_memory=133070966784, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}}

removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}, term: 43, version: 397621, reason: ApplyCommitRequest{term=43, version=397621, sourceNode={es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}{ml.machine_memory=133070966784, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}}

导致193上的shard一直rebalance。但193上日志,不知道什么原因丢失了,能找到的最早的日志也是23号下午5点21分之后的日志了。经过日志排查问题到这里被卡主了。

回看发生问题前的一段时间各个节点的监控,CPU、内存、磁盘、写入量等都没有异常的波动,唯一发现的一个异常:

在这里插入图片描述

那么generic thread pool是做啥的呢?

=> 用于通用操作(例如,后台节点发现)它的线程池类型是动态缩放的

到这里也只是能印证了,节点间相互发现有问题。那么到底什么原因造成的集群宕机呢?

时间来到了今天上午也就是25号9点左右的时候,发现集群又开始rebalance shards,通过

GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason

可以看到shard重新迁移的原因是node-left,到master节点的机器看下日志:

root@jiankunking-es-02:~# docker logs -f 1f741600dbcc |grep node-left
{"type": "server", "timestamp": "2023-10-23T09:33:18,042Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 399161, delta: removed {{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:52:35,140Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 404805, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:53:24,564Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404807, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:53:30,879Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404810, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }

按理说 193 194 195这些机器与另外4台机器相比,区别点在于挂了多块盘,那性能应该更好啊,为啥连不上的会是他们呢?

又回看了下节点日志,发现节点间互连存在大量的超时

master(194)节点日志:

{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758574}{false}{false}{false}}] of size [21075] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{70758608}{false}{false}{false}}] of size [1992] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{70758614}{false}{false}{false}}] of size [5563] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758686}{false}{false}{false}}] of size [9103] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [28603ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758737}{false}{false}{false}}] of size [12666] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [28603ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA"  }

普通节点(193)

{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855747}{false}{false}{false}}] of size [3828] on [Netty4TcpChannel{localAddress=/10.163..204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855763}{false}{false}{false}}] of size [14563] on [Netty4TcpChannel{localAddress=/10.163.204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855795}{false}{false}{false}}] of size [12940] on [Netty4TcpChannel{localAddress=/10.163.204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{73855813}{false}{false}{false}}] of size [10721] on [Netty4TcpChannel{localAddress=/10.163.204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw"  }

看到到这里难道是网络或者网卡打满了?

看下带宽Bandwidth:

root@jiankunking-es-02:~# ethtool eth0 | grep Speed
        Speed: Unknown!
root@jiankunking-es-02:~# 

再看下网卡实时流量

iftop -n

在这里插入图片描述

找负责机器的同事核对,得到以下信息

  • 瓶颈是按照网卡上限看,不看两个服务器之间链路 =>也就是服务器间的网络带宽应该是远大于机器网卡的,所以服务器的网卡更容易成为瓶颈
  • 服务器网卡上限是:内网带宽(bit/s) 10.00Gpbs=>也就是服务器网卡总流量的上限是 10/8 大约是1G的流量

同时要了下阿里云的监控图

在这里插入图片描述

从阿里云的监控可以判断出这时候网卡并没有满的(进+出,也就能到阈值10%的样子),我们再看下23号傍晚的阿里云监控

在这里插入图片描述

这里的要注意下,对于阿里云来说,比如网卡有阈值,但短时间阿里云是允许突破限制的(虽然这里并没有满)

既然网络没有问题,那问题出在哪?

算了,找一个节点频繁added remove的时间在master(193)上抓包一下

在这里插入图片描述

从抓包中可以看出以下信息:

  • 网络连接正常,建立连接–发送数据–连接关闭
  • 是master节点主动断开的连接

但是master节点为什么断开的连接,就不清楚了。追查到这里,感觉问题应该不是在网络还是es自身哪里出了问题。

继续一顿Google,找到了这么一个issue:https://github.com/elastic/elasticsearch/issues/67873

虽然各种异常提示都不太一样,但现象是一样的

在这里插入图片描述

在issue中发现shard数太多会对于集群造成一些不利的影响。

我们回看了下集群shard监控
在这里插入图片描述

发现shard数一直在增加,整个集群已经到达6w+。要不我们也删除部分历史索引试试?

这时候74节点一直在added removed,我手动重启了74节点,并将尝试将7、8月份的索引删除,集群逐步恢复到70%+的时候,又有节点下线了;

我们再次尝试删除9月份加10月份1号-6号的索引,这时候集群恢复成功了(这时候集群shard数已经从6w降到了4w)。

问题暂时得到了处理,观察一段时间再没有发现node-left的情况,但根本原因并没有找到。(比如23号重启集群后,shard数也是接近6w,但也重启成功了,但25号却一直没有成功)。

如果大家有这方面的资料或者最佳实践,欢迎留言。

小结

经过这件事之后,我最大的一个感触就是官方的推荐值还是要参考一下:
Elasticsearch 集群内应该设置多少个分片(shard)?

主要是因为要控制成本所以只要集群不出问题,官方的建议,我们一般是这么个态度:

在这里插入图片描述

其实对于shard大小跟每个节点推荐的数据量一样,我们都没有接受他们的建议。比如这次排查为什么没有怀疑过30T数据量太多呢?因为之前都是40T+。
在这里插入图片描述

其实经过这件事,还有以下几个点想分享一下

  • 各种集群间应该资源隔离,也就是机器不要共用
  • 不要在各种边缘上运行
    • 比如我们es的磁盘其实一直就是瓶颈,只要触发rebalance shards,磁盘io一定是满的,就会导致写入速率下降等各种问题
  • 集群内各个相同功能的节点配置要一样,比如我们的es集群,磁盘大小不一样,当到达磁盘阈值的时候,集群负载就会不均衡

http://www.kler.cn/a/107333.html

相关文章:

  • 传奇996_19——常用函数
  • python怎么设置环境变量
  • 国家网络安全法律法规
  • 如何判定linux系统CPU的核心架构
  • redis bind 127.0.0.1和bind 10.34.56.78的区别
  • Django基础用法+Demo演示
  • 软考高项-计算题(3)
  • 【LeetCode】5. 最长回文子串
  • 10月28日,每日信息差
  • HarmonyOS开发:探索组件化模式开发
  • Flink CDC 2.0 主要是借鉴 DBLog 算法
  • PostgreSQL basebackup备份和恢复
  • 闲聊一下写技术博客的一些感想
  • Dijkstra算法基础详解,附有练习题
  • OpenAI大模型项目计划表(InsCode AI 创作助手)
  • Android Studio 查看Framework源码
  • 基于LCC的Buck谐振变换器研究
  • arcgis js api FeatureLayer加载时返回数据带*问题
  • 针对多分类问题,使用深度学习--Keras进行微调提升性能
  • MySQL数据库#6
  • Redis 主从复制和哨兵监控,实现Redis高可用配置
  • 革新技术,释放创意 :Luminar NeoforMac/win超强AI图像编辑器
  • 浅谈UI自动化测试
  • KDChart3.0编译过程-使用QT5.15及QT6.x编译
  • 深度学习——图像分类(CIFAR-10)
  • Centos系统使用yum安装Java jdk