当前位置: 首页 > article >正文

K8S集群之-ETCD集群监控

###   生产ETCD集群监控

核心指标

  • etcd服务存活状态

up{job=~"kubernetes-etcd.*"}==0

​ 说明:up==0代表服务挂掉

  • etcd是否有脱离情况

    etcd_server_has_leader{job=~"kubernetes-etcd.*"}==0

    说明:每个instance,该值应该都为1,否则这个节点可能已经离开集群,最好在发生过半这样的情况前介入

  • etcd改变次数

increase(etcd_server_leader_changes_seen_total{job=~"kubernetes-etcd.*"}[1h]) >3

​ 说明:这个指标metrics类型为counter,即它是单调递增的,可以监控该值的变化率,如果发现变化率高,说明集群的负载过高或者网络连接可能不稳定

  • leader选举失败

    rate(etcd_server_proposals_failed_total{job=~"kubernetes-etcd.*"}[15m])!=0

    说明:该值的类型也是counter。proposal字面意思是“提案”,客户端的一个写操作可以认为是一个提案,提案需要集群内的Etcd实例来“表决”,如果上述值不为零,说明有proposal没有提交成功,如果经常这样,说明集群leader选举失败或者集群有过半节点离线

  • http访问5分钟内失败百分比(待定)

sum by(method) (rate(etcd_http_failed_total{job=~"kubernetes-etcd.*"}[5m])) / sum by(method) (rate(etcd_http_received_total{job=~"kubernetes-etcd.*"}[5m]))> 0.05

  • etcd集群切主次数

changes(etcd_server_leader_changes_seen_total{job=~".*"}[1d])>1

  • WAL文件顺序写入的持久化时间

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*"}[5m]))>0.5

​ 说明:Etcd的持久化保证依赖WAL和快照机制,这些全靠硬盘的IO表现。如果硬盘的性能不佳,在高负载情况下,将严重拖慢Etcd的处理速度,因此在生产环境中建议使用SSD来替代传统机械硬盘。可以通过监控
etcd_disk_backend_commit_duration_seconds_bucket的0.99分位数来衡量硬盘的表现情况 如果该值仅几个毫秒,说明你的Etcd比较健康

  • 磁盘使用率

    (etcd_mvcc_db_total_size_in_bytes{}/etcd_server_quota_backend_bytes{}) * 100>80

prometheus的yaml配置

  - job_name: 'kubernetes-etcd-19'
    scheme: https
    tls_config:
      cert_file: /usr/local/prometheus/ssl/kube-etcd-19.pem
      key_file: /usr/local/prometheus/ssl/kube-etcd-19-key.pem
      insecure_skip_verify: true
    scrape_interval: 120s
    static_configs:
    - targets: ['110.152.117.19:2379']
  - job_name: 'kubernetes-etcd-20'
    scheme: https
    tls_config:
      cert_file: /usr/local/prometheus/ssl/kube-etcd-20.pem
      key_file: /usr/local/prometheus/ssl/kube-etcd-20-key.pem
      insecure_skip_verify: true
    scrape_interval: 120s
    static_configs:
    - targets: ['110.152.117.20:2379']
  - job_name: 'kubernetes-etcd-21'
    scheme: https
    tls_config:
      cert_file: /usr/local/prometheus/ssl/kube-etcd-21.pem
      key_file: /usr/local/prometheus/ssl/kube-etcd-21-key.pem
      insecure_skip_verify: true
    scrape_interval: 120s
    static_configs:
    - targets: ['110.152.117.21:2379']

prometheus的rules配置文件

groups:
- name: 公共事业部ETCD集群监控  #project name取公司名称
  rules:
  - alert: "ETCD服务存活状态活监控"
    expr:  up{job=~"kubernetes-etcd.*"}==0
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群已经离开集群,(资源信息:{{ $labels.instance }}),请尽快处理!"

  - alert: "ETCD是否有脱离监控"
    expr:  etcd_server_has_leader{job=~"kubernetes-etcd.*"}==0
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群宕机或掉线,(资源信息:{{ $labels.instance }}),请尽快处理!"
  - alert: "ETCD改变次数监控"
    expr:  increase(etcd_server_leader_changes_seen_total{job=~"kubernetes-etcd.*"}[1h]) >3
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群负载过高或者网络连接不稳定,(资源信息:{{ $labels.instance }}),请尽快处理!"


  - alert: "ETCD选举监控"
    expr:  rate(etcd_server_proposals_failed_total{job=~"kubernetes-etcd.*"}[15m])!=0
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群leader选举失败{{ $value }},(资源信息:{{ $labels.instance }}),请尽快处理!"      

  - alert: "ETCD切主次数监控"
    expr:  changes(etcd_server_leader_changes_seen_total{job=~".*"}[1d])>1
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群切主次数{{ $value }},(资源信息:{{ $labels.instance }}),请尽快处理!"
      
  - alert: "ETCD集群WAL写入时间"
    expr:  histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*"}[5m]))>0.5
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群WAL文件顺序写入的持久化时间{{ $value }},(资源信息:{{ $labels.instance }}),请尽快处理!"
      
  - alert: "ETCD集群磁盘使用率"
    expr:  (etcd_mvcc_db_total_size_in_bytes{}/etcd_server_quota_backend_bytes{}) * 100>80
    for: 30s
    labels:
      severity: "重要"
      team: ops-gt-monitor
      alert_type: "ETCD告警"
      alert_host: "{{ $labels.service }}"
      alert_value: "{{ $value }}"
      alert_subject: "ETCD告警"
    annotations:
      summary: "ETCD集群监控"
      description: "ETCD集群磁盘使用率{{ $value }},(资源信息:{{ $labels.instance }}),请尽快处理!"

http://www.kler.cn/a/1009.html

相关文章:

  • microPython搭建webServer--(一)使用microdot库实现基本页面显示
  • 解读若依微服务架构图:架构总览、核心模块解析、消息与任务处理、数据存储与缓存、监控与日志
  • nginx-lua模块处理流程
  • git flow流程拆解实践指导
  • burpsiute的基础使用(2)
  • 《拉依达的嵌入式\驱动面试宝典》—计算机网络篇(二)
  • 有图解有案例,我终于把 Condition 的原理讲透彻了
  • 几个cve漏洞库查询网站-什么是CVE?常见漏洞和暴露列表概述
  • Android 自定义view优化方案
  • spring事务 只读此文
  • Go panic的学习
  • 初识C++需要了解的一些东西(2)
  • 从GPT到GPT-3:自然语言处理领域的prompt方法
  • 开源超级终端工具——WindTerm
  • Qt·Linux下Qt、Qml程序的打包
  • Linux(传输层二继续讲TCP)
  • 关于进制转换
  • 【C++】Google编码风格学习
  • 11广义表的基本概念和性质
  • mysql创建索引导致死锁,数据库崩溃,完美解决方案
  • 自训练和协同训练简述
  • C/C++考试必考题目(含答案*仅供参考)
  • 0108 JQuery
  • C # FileStream文件流
  • Vue初入,了解Vue的发展与优缺点
  • 第二章 测验【嵌入式系统】