当前位置: 首页 > article >正文

k8s 1.28.2 集群部署 Thanos 对接 MinIO 实现 Prometheus 数据长期存储

文章目录

    • @[toc]
    • 什么是 Thanos
    • Thanos 的主要功能
    • Thanos 的架构组件
    • Thanos 部署架构
      • Sidecar
      • Receive
      • 架构选择
    • 开始部署
      • 部署架构
      • 创建 namespace
      • node-exporter 部署
      • kube-state-metrics 部署
      • Prometheus + Thanos-Sidecar 部署
        • 固定节点创建 label
        • 生成 secret
          • MinIO 配置
          • etcd 证书
        • 启动 Prometheus + Thanos-Sidecar
      • Thanos-store-gateway 部署
      • Thanos-compact 部署
      • Thanos-query 部署
      • Thanos-query-globle 部署
      • Thanos-query-frontend 部署
      • Grafana 部署
      • 增加 Thanos 和 MinIO 监控
    • Grafana dashboard
      • coredns
      • etcd
      • Thanos
      • node-exporter
    • 最后

什么是 Thanos

  • Thanos 官网
  • Thanos quay.io 镜像仓库
  • Thanos Github

Thanos 是一个强大的 Prometheus 扩展解决方案,能够解决 Prometheus 在大规模环境下的存储、扩展性和高可用性问题。

它非常适合大规模集群监控需求,尤其是需要长期存储监控数据和全局查询。

Thanos 的主要功能

  • 全局查询(Global Query View)
    • 通过其 Querier 组件提供从多个 Prometheus 实例查询的能力,并能对跨多个数据源进行全局去重查询
    • 即使在大规模集群中运行多个 Prometheus 实例,用户也可以从一个接口统一查询所有的监控数据
  • 长期存储(Unlimited Retention)
    • Prometheus 默认只适用于短期数据存储,而 Thanos 提供了将监控数据推送到长期存储(如 Amazon S3、Google Cloud Storage、MinIO 等对象存储)的能力
  • Prometheus 集成(Prometheus Compatible)
    • Grafana 和其他支持 Prometheus 查询 API 的工具都可以通过 Thanos 查询 Prometheus 数据
  • 数据压缩与去重(Downsampling & Compaction)
    • Thanos 的 Compactor 组件会定期对存储在对象存储中的数据进行压缩、去重和优化,以减少存储开销并提高查询性能

Thanos 的架构组件

遵循 KISS 和 Unix 理念,Thanos 由一组组件组成,每个组件都扮演一个特定的角色

  • Sidecar
    • 与每个 Prometheus 实例一起部署,负责将数据推送到对象存储,并暴露出 Prometheus 的数据给 Querier
  • Store Gateway
    • 简称为 Store,专门用于从对象存储(如 AWS S3、Google Cloud Storage、MinIO 等)中检索历史监控数据的组件
  • Compactor
    • 负责对存储在对象存储中的数据进行压缩、去重和优化,提升查询性能并减少存储开销
  • Receiver
    • 专门用于接收和存储 Prometheus 实例通过 Remote Write 发送数据的组件(强烈建议使用 Prometheus v2.13.0+,因为它的远程读取功能得到了改进。)
  • Ruler/Rule
    • 类似 Prometheus 的 Alertmanager,它允许用户基于存储的数据执行告警和规则评估
  • Querier/Query
    • 一个用于全局查询的组件,能够从多个 Prometheus 实例和对象存储中提取数据,并提供统一的查询接口
  • Query Frontend
    • Query 的前端页面,通过查询分片缓存请求队列等机制,加速复杂查询,并提升查询在高负载环境下的响应速度

Thanos 部署架构

Sidecar

Sidecar 使用 Prometheus 的 reload 接口。确保 Prometheus 启用 --web.enable-lifecycle 参数

在这里插入图片描述

  • 优点
    • 轻量级:Sidecar 是一个轻量的代理,只需要运行在 Prometheus 实例旁边即可,无需对 Prometheus 进行大的改动。
    • 实时数据访问:Sidecar 允许 Thanos 直接访问 Prometheus 的实时监控数据,保证了最新监控信息的可查询性。
    • 长期存储集成:可以将 Prometheus 的数据定期上传到对象存储,解决了 Prometheus 原生不具备长期存储的缺陷。
  • 缺点
    • 依赖 Prometheus:Sidecar 必须依赖于运行的 Prometheus 实例,如果 Prometheus 实例宕机,Sidecar 也无法提供数据查询功能。
    • 水平扩展有限:Sidecar 并不设计用于大规模数据接收,它主要是作为 Prometheus 的配套组件,无法像 Receiver 那样水平扩展来处理大量的数据。

Receive

在这里插入图片描述

  • 优点

    • 大规模数据接收:Receiver 能够高效接收大量来自 Prometheus 实例的数据,适用于大规模部署。
    • 多租户支持:可以处理和隔离多个租户的数据,在需要监控多个独立环境时非常有用。
    • 水平扩展:通过数据分片和扩展 Receiver 实例,能够处理越来越多的数据接收任务。
    • 去重和高可用性:Receiver 能够通过去重机制,确保多实例高可用性,并避免重复数据存储。
  • 缺点

    • 无直接查询功能:Receiver 本身不具备查询功能,接收到的数据需要依赖其他 Thanos 组件(如 Querier 和 Store)进行查询和分析。

      实时性较低:相比直接从 Prometheus 实例查询数据,Receiver 可能在数据处理和查询时存在一定的延迟。

Sidecar 与 Receiver 的区别对比(抄自 ChatGPT)

特性Thanos SidecarThanos Receiver
主要功能集成 Prometheus 实例,提供实时数据访问和长期存储接收 Prometheus 实例的远程写入数据,并存储
数据源直接从 Prometheus 获取数据Prometheus 的 Remote Write 数据
数据存储方式定期上传 Prometheus 数据块到对象存储将接收到的数据存储在本地或对象存储中
水平扩展性无法扩展,只与单个 Prometheus 实例集成可以通过增加实例水平扩展
实时数据查询支持 Prometheus 实时数据查询无法直接查询数据
多租户支持不支持支持,适用于多租户环境
高可用性依赖 Prometheus 实例支持高可用部署和去重机制
适用场景与现有 Prometheus 实例集成,长期存储数据大规模、多租户环境的数据接收和存储

架构选择

  • 多集群 thanos 监控告警实践
  • 打造云原生大型分布式监控系统 (三): Thanos 部署与实践
  • 以下的建议取自这两个博客,具体的架构选择,也只能大家根据自己的实际情况验证和判断
  • Sidecar 与 Receiver 的最主要的区分就是最新数据的查询方式不同
    • Sidecar 最新数据直接读取 Promethues 数据目录
    • Receiver 的所有数据都在存储服务里面(S3 等存储服务)
  • Prometheus 集群不大,采集的服务不多的情况下,即使 Sidecar 和 全局查询的 Query 不在一个机房,只要都是国内的,查询延迟一般不会太高
  • Prometheus 集群很大,要采集的数据也非常多的情况下,尽可能还是选择 Sidecar 架构,因为数据一旦激增,Receiver 的压力会非常非常大,需要很大的资源,也需要很强大的存储性能
  • 除非主要目的是针对指标历史做分析使用,或者 Prometheus 有某些特殊场景无法持久化数据,这些以外的场景,建议使用 Sidecar

开始部署

采用 sidecar 模式部署

部署架构

考虑用 Prometheus 自带的 rule 做告警,这边没打算部署 Thanos-rule

k8s 集群 Ak8s 集群 B
Prometheus:v2.54.1Prometheus:v2.54.1
node-exporter:v1.8.2node-exporter:v1.8.2
kube-state-metrics:v2.11.0kube-state-metrics:v2.11.0
Thanos-sidecar:v0.36.1Thanos-sidecar:v0.36.1
Thanos-query:v0.36.1Thanos-query:v0.36.1
Thanos-store-gateway:v0.36.1Thanos-store-gateway:v0.36.1
Thanos-compact:v0.36.1
Thanos-query-globle:v0.36.1
Thanos-query-frontend:v0.36.1
Grafana

MinIO 部署可以看我之前的博客:k8s 1.28.2 集群部署 MinIO 分布式集群,先提前准备好 MinIO 集群

创建 namespace

以下所有的 k 命令都代表 kubectl 命令,部署这块只展示一个环境的,我这边是两套 k8s 集群,需要部署两套 Prometheus

k create ns monitor

node-exporter 部署

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: node-exporter
  name: node-exporter-svc
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 9100
    protocol: TCP
  selector:
    app.kubernetes.io/name: node-exporter
  type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/name: node-exporter
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
    spec:
      containers:
      - args:
        - --path.rootfs=/rootfs
        - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
        image: docker.m.daocloud.io/prom/node-exporter:v1.8.2
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: http
        volumeMounts:
        - mountPath: /rootfs
          name: root
          readOnly: true
      hostIPC: true
      hostNetwork: true
      hostPID: true
      volumes:
      - hostPath:
          path: /
        name: root

kube-state-metrics 部署

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - serviceaccounts
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingressclasses
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterrolebindings
  - clusterroles
  - rolebindings
  - roles
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
    spec:
      automountServiceAccountToken: true
      containers:
      - image: docker.m.daocloud.io/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.11.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          httpGet:
            path: /livez
            port: http-metrics
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /readyz
            port: telemetry
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics-sa

Prometheus + Thanos-Sidecar 部署

固定节点创建 label
k label node 192.168.22.125 prometheus=true
生成 secret
MinIO 配置

因为包含 MinIO 的 access_key 和 secret_key,尽量别用 configmap 去明文读取,用 secret 读取,一会输出的内容,合并成一行后,需要放到下面的 secret 里面去替换掉

cat <<EOF | base64 -
type: S3
config:
  bucket: "prom-thanos-sidecar"
  endpoint: "minio.api.devops.icu"
  access_key: "gsl2dzAHviNzabSn0ikw"
  secret_key: "82zQ0UMDlOo3LxCQM9TqSygEYrMuxSSRYQdO1KXF"
  insecure: true
EOF
etcd 证书

我是 kubeadm 部署的 k8s 集群,我的证书路径是 /etc/kubernetes/pki/etcd,我直接把本地文件生成 secret

certs_dir=/etc/kubernetes/pki/etcd; \
k create secret generic etcd-pki -n monitoring \
--from-file=ca=${certs_dir}/ca.crt \
--from-file=cert=${certs_dir}/server.crt \
--from-file=key=${certs_dir}/server.key
启动 Prometheus + Thanos-Sidecar

Prometheus 的数据存储用的是本地 hostpath 的方式,由于 Thanos 需要读取 Prometheus 的数据,所以要保持用户一致,不然会因为权限问题,Thanos 没法读取数据,也没法将数据上传到 MinIO,具体的报错参考:ts=2024-10-21T06:09:16.284378709Z caller=sidecar.go:410 level=warn err="upload 01JAP2JAZ0AQT8BEYFY30A4VVD: hard link block: hard link file chunks/000001: link /etc/prometheus/data/01JAP2JAZ0AQT8BEYFY30A4VVD/chunks/000001 /etc/prometheus/data/thanos/upload/01JAP2JAZ0AQT8BEYFY30A4VVD/chunks/000001: operation not permitted" uploaded=0

  • Prometheus 参数简介
    • --storage.tsdb.min-block-duration=2h:最小2小时生成一次新的数据块
    • --storage.tsdb.max-block-duration=2h:最大2小时生成一次新的数据块
    • --storage.tsdb.retention.time=6h:Prometheus 本地数据保留时长,默认是15天,这个可以自己根据实际磁盘情况调整
    • --storage.tsdb.wal-compression:启用 WAL 日志压缩,减少 WAL 文件的大小,降低存储空间的需求
    • --storage.tsdb.no-lockfile:禁用锁文件,避免影响 Thanos 上传数据块到 MinIO
    • --web.enable-lifecycle:支持热更新 localhost:9090/-/reload 热加载配置文件
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: prometheus-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
  name: prometheus-svc
  namespace: monitoring
spec:
  ports:
  - name: http
    port: 9090
    targetPort: 9090
  - name: grpc
    port: 10901
    targetPort: 10901
  selector:
    app: prometheus
  type: ClusterIP
---
apiVersion: v1
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      evaluation_interval: 30s
      scrape_timeout: 10s
      external_labels:
        cluster: devops
        replica: $(POD_NAME)
    rule_files:
    - /etc/prometheus/rules/*.yml
    scrape_configs:
    - job_name: prometheus
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: prometheus
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9090
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: kube-apiserver
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: kubelet
      metrics_path: /metrics/cadvisor
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [instance]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: etcd
      kubernetes_sd_configs:
      - role: pod
      scheme: https
      tls_config:
        ca_file: /etc/prometheus/etcd-ssl/ca
        cert_file: /etc/prometheus/etcd-ssl/cert
        key_file: /etc/prometheus/etcd-ssl/key
        insecure_skip_verify: false
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_component]
        regex: etcd
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:2379
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: coredns
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_k8s_app]
        regex: kube-dns
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9153
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: node-exporter
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - source_labels: [__meta_kubernetes_node_address_InternalIP]
        action: replace
        target_label: ip
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: kube-state-metrics
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;kube-state-metrics
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:8080
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
kind: ConfigMap
metadata:
  name: prometheus-cm
  namespace: monitoring
---
apiVersion: v1
data:
  config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
  labels:
    app.kubernetes.io/name: prometheus
  name: thanos-config
  namespace: monitoring
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: prometheus
                operator: In
                values:
                - "true"
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - prometheus
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --config.file=/etc/prometheus/config/prometheus.yml
        - --storage.tsdb.path=/etc/prometheus/data
        - --storage.tsdb.min-block-duration=2h
        - --storage.tsdb.max-block-duration=2h
        - --storage.tsdb.retention.time=6h
        - --storage.tsdb.wal-compression
        - --storage.tsdb.no-lockfile
        - --web.enable-lifecycle
        command:
        - /bin/prometheus
        env:
        - name: TZ
          value: Asia/Shanghai
        image: quay.io/prometheus/prometheus:v2.54.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 60
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: http
          timeoutSeconds: 1
        name: prometheus
        ports:
        - containerPort: 9090
          name: http
        readinessProbe:
          failureThreshold: 60
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: http
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 500m
            memory: 1024Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - mountPath: /etc/prometheus/data
          name: prometheus-home
        - mountPath: /etc/prometheus/config
          name: prometheus-config
        - mountPath: /etc/prometheus/etcd-ssl
          name: etcd-ssl
      - args:
        - sidecar
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --tsdb.path=/etc/prometheus/data
        - --prometheus.url=http://localhost:9090
        - --objstore.config-file=/etc/thanos/config/thanos-sidecar.yml
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        name: thanos-sidecar
        ports:
        - containerPort: 10901
          name: grpc
        volumeMounts:
        - mountPath: /etc/prometheus/data
          name: prometheus-home
        - mountPath: /etc/thanos/config/thanos-sidecar.yml
          name: thanos-config
          readOnly: true
          subPath: config
      imagePullSecrets:
      - name: harbor-secret
      initContainers:
      - command:
        - sh
        - -c
        - '[ -d /etc/prometheus/data/thanos ] || chown -R 65534:65534 /etc/prometheus/data'
        image: quay.io/prometheus/prometheus:v2.54.1
        imagePullPolicy: IfNotPresent
        name: init-dir
        securityContext:
          runAsUser: 0
        volumeMounts:
        - mountPath: /etc/prometheus/data
          name: prometheus-home
      securityContext:
        runAsUser: 65534
      serviceAccount: prometheus-sa
      terminationGracePeriodSeconds: 0
      volumes:
      - hostPath:
          path: /approot/k8s_data/prometheus
          type: DirectoryOrCreate
        name: prometheus-home
      - configMap:
          name: prometheus-cm
        name: prometheus-config
      - name: thanos-config
        secret:
          secretName: thanos-config
      - name: etcd-ssl
        secret:
          secretName: etcd-pki

Thanos-store-gateway 部署

secret 里面涉及的内容,和 sidecar 里面的是一样的,记得替换成自己的

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-sa
  namespace: monitoring
---
apiVersion: v1
data:
  config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-objstore-config
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-headless
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-store-gateway
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-store-gateway
  serviceName: thanos-store-gateway-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-store-gateway
    spec:
      containers:
      - args:
        - store
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/store
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --no-cache-index-header
        - --objstore.config-file=/etc/thanos/objstore.yaml
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-store-gateway
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - mountPath: /etc/thanos/objstore.yaml
          name: objstore-config
          readOnly: true
          subPath: config
        - mountPath: /var/thanos/store
          name: data
          readOnly: false
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-store-gateway-sa
      volumes:
      - name: objstore-config
        secret:
          secretName: thanos-objstore-config
      - emptyDir:
          sizeLimit: 100Mi
        name: data

Thanos-compact 部署

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-sa
  namespace: monitoring
---
apiVersion: v1
data:
  config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-objstore-config
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-headless
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-store-gateway
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-store-gateway
  serviceName: thanos-store-gateway-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-store-gateway
    spec:
      containers:
      - args:
        - store
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/store
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --no-cache-index-header
        - --objstore.config-file=/etc/thanos/objstore.yaml
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-store-gateway
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - mountPath: /etc/thanos/objstore.yaml
          name: objstore-config
          readOnly: true
          subPath: config
        - mountPath: /var/thanos/store
          name: data
          readOnly: false
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-store-gateway-sa
      volumes:
      - name: objstore-config
        secret:
          secretName: thanos-objstore-config
      - emptyDir:
          sizeLimit: 100Mi
        name: data
root@dream:/approot/chen2ha/kubetpl 13:58:08 # cat output/thanos-compact.yaml
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-compact
  name: thanos-compact-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-compact
  name: thanos-compact-headless
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-compact
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-compact
  name: thanos-compact
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-compact
  serviceName: thanos-compact-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-compact
    spec:
      containers:
      - args:
        - compact
        - --wait
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/compact
        - --http-address=0.0.0.0:10902
        - --objstore.config-file=/etc/thanos/objstore.yaml
        - --compact.enable-vertical-compaction
        - --deduplication.replica-label=replica
        - --deduplication.func=penalty
        - --delete-delay=1d
        - --retention.resolution-raw=7d
        - --retention.resolution-5m=15d
        - --retention.resolution-1h=30d
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-compact
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - mountPath: /etc/thanos/objstore.yaml
          name: objstore-config
          readOnly: true
          subPath: config
        - mountPath: /var/thanos/compact
          name: data
          readOnly: false
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-compact-sa
      volumes:
      - name: objstore-config
        secret:
          secretName: thanos-objstore-config
      - emptyDir:
          sizeLimit: 100Mi
        name: data

Thanos-query 部署

  • --query.replica-label 参数指定依据哪个标签做数据的去重,在 Prometheus 的 external_labels 里面配置的
  • 给 Thanos-query 的 gRPC 端口配一个独立的 svc ,通过 nodeport 的方式暴露端口,再由一个全局的 Thanos-query 来注册各个集群的 Thanos-query,最终通过 Thanos-query-frontend 来查询
    • 当然,如果资源足够,也完全可以每个集群再多部署一个 Thanos-query 来当作全局查询,内外查询做一个分流
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query-svc
  namespace: monitoring
spec:
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-query
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query-np-svc
  namespace: monitoring
spec:
  ports:
  - name: grpc
    nodePort: 31901
    port: 10901
    targetPort: grpc
  selector:
    app.kubernetes.io/name: thanos-query
  type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-query
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-query
    spec:
      containers:
      - args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --endpoint=dnssrv+_grpc._tcp.thanos-store-gateway-headless.monitoring.svc.cluster.local
        - --endpoint=dnssrv+_grpc._tcp.prometheus-svc.monitoring.svc.cluster.local
        env:
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-query
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-query-sa

Thanos-query-globle 部署

--endpoint 我是两个集群各挑了两个节点

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-globle
  name: thanos-query-globle-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-globle
  name: thanos-query-globle-svc
  namespace: monitoring
spec:
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-query-globle
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-globle
  name: thanos-query-globle
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-query-globle
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-query-globle
    spec:
      containers:
      - args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --endpoint=192.168.22.112:31901
        - --endpoint=192.168.22.113:31901
        - --endpoint=192.168.22.122:31901
        - --endpoint=192.168.22.123:31901
        env:
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-query-globle
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-query-globle-sa

Thanos-query-frontend 部署

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-frontend
  name: thanos-query-frontend-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-frontend
  name: thanos-query-frontend-svc
  namespace: monitoring
spec:
  ports:
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-query-frontend
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-frontend
  name: thanos-query-frontend
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-query-frontend
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-query-frontend
    spec:
      containers:
      - args:
        - query-frontend
        - --log.level=info
        - --log.format=logfmt
        - --http-address=0.0.0.0:10902
        - --query-frontend.downstream-url=http://thanos-query-globle-svc.monitoring.svc.cluster.local:10902
        env:
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-query-frontend
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-query-frontend-sa
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: thanos-query-frontend
  namespace: monitoring
spec:
  ingressClassName: nginx
  rules:
  - host: thanos.devops.icu
    http:
      paths:
      - backend:
          service:
            name: thanos-query-frontend-svc
            port:
              number: 10902
        path: /
        pathType: Prefix

Grafana 部署

这边采用了 nfs 针对 dashboard 的 json 文件做了持久化,有修改或者增加就比较方便,直接上传到 nfs 就可以了

---
apiVersion: v1
data:
  grafana.ini: |
    provisioning = /etc/grafana/provisioning
kind: ConfigMap
metadata:
  name: grafana-cm
  namespace: monitoring
---
apiVersion: v1
data:
  prometheus.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://thanos-query-globle-svc.monitoring.svc.cluster.local:10902
kind: ConfigMap
metadata:
  name: grafana-datasource
  namespace: monitoring
---
apiVersion: v1
data:
  dashboards.yaml: |
    apiVersion: 1
    providers:
    - name: 'a unique provider name'
      orgId: 1
      folder: ''
      folderUid: ''
      type: file
      disableDeletion: false
      editable: true
      updateIntervalSeconds: 10
      allowUiUpdates: true
      options:
        # <string, required> path to dashboard files on disk. Required
        path: /etc/grafana/provisioning/dashboards/views
kind: ConfigMap
metadata:
  name: grafana-dashboard
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: grafana
  name: grafana-svc
  namespace: monitoring
spec:
  ports:
  - port: 3000
    protocol: TCP
    targetPort: http-grafana
  selector:
    app.kubernetes.io/name: grafana
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: grafana
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: grafana
  template:
    metadata:
      labels:
        app.kubernetes.io/name: grafana
    spec:
      containers:
      - env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: docker.m.daocloud.io/grafana/grafana:11.3.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 3000
          timeoutSeconds: 1
        name: grafana
        ports:
        - containerPort: 3000
          name: http-grafana
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /robots.txt
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 2
        resources:
          limits:
            cpu: 1000m
            memory: 1024Mi
          requests:
            cpu: 250m
            memory: 750Mi
        volumeMounts:
        - mountPath: /etc/grafana/grafana.ini
          name: grafana-config
          subPath: grafana.ini
        - mountPath: /etc/grafana/provisioning/datasources/prometheus.yaml
          name: grafana-datasource
          subPath: prometheus.yaml
        - mountPath: /etc/grafana/provisioning/dashboards/grafana-dashboard.yaml
          name: grafana-dashboard
          subPath: dashboards.yaml
        - mountPath: /etc/grafana/provisioning/dashboards/views
          name: grafana
          subPathExpr: $(POD_NAME)
      securityContext:
        fsGroup: 472
        supplementalGroups:
        - 0
      volumes:
      - configMap:
          name: grafana-cm
        name: grafana-config
      - configMap:
          name: grafana-datasource
        name: grafana-datasource
      - configMap:
          name: grafana-dashboard
        name: grafana-dashboard
  volumeClaimTemplates:
  - metadata:
      name: grafana
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: nfs-client
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
spec:
  ingressClassName: nginx
  rules:
  - host: grafana.devops.icu
    http:
      paths:
      - backend:
          service:
            name: grafana-svc
            port:
              number: 3000
        path: /
        pathType: Prefix

增加 Thanos 和 MinIO 监控

Prometheus 采集 MinIO 指标需要鉴权,需要通过 mc 命令配置 JWT 认证,可以查看官方文档:mc admin prometheus generate

或者 MinIO 配置 MINIO_PROMETHEUS_AUTH_TYPE=public 参数,需要重启 MinIO 生效,使 Prometheus 可以直接访问 metrics api

    - job_name: minio
      metrics_path: /minio/v2/metrics/cluster
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: storage;minio-svc
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9000
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: thanos-query
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;thanos-query-svc
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10902
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: thanos-store-gateway
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;thanos-store-gateway-headless
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10902
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: thanos-compact
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;thanos-compact-headless
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10902
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

Grafana dashboard

记录几个我这边配置的 dashboard id,因为我这边是双 k8s 集群,所以要加上 cluster 这个变量,大部分都需要自己再细调一下

coredns

14981

在这里插入图片描述

etcd

用的官方给的模板:grafana.json

在这里插入图片描述

Thanos

12937

在这里插入图片描述

node-exporter

12633 或者 21902

在这里插入图片描述

16098

在这里插入图片描述

最后

yaml 和 dashboard 的 json 文件可以从 gitee 自取:https://gitee.com/chen2ha/yaml_for_kubernetes/tree/master/thanos


http://www.kler.cn/a/372955.html

相关文章:

  • Java集合常见面试题总结(5)
  • es 常用命令(已亲测)
  • JavaFx -- chapter05(多用户服务器)
  • 软件行业似乎要消失了
  • Clace和sqlite-fs:使用SQLite替代文件系统
  • 详细指南:解决Garmin 手表无法与电脑连接的问题
  • GO语言微服务 服务注册与服务发现平台 - Nacos go sdk
  • 通过route访问Openshift上的HTTP request报错504 Gateway Time-out【已解决】
  • C#读取.ini配置文件
  • 手工方式屏蔽某一个网站
  • 利用摄像机实时接入分析平台LiteAIServer视频智能分析软件进行视频监控:过亮过暗检测算法详解
  • AHT20 HAL库驱动
  • 人工智能:开启未来之门
  • 如何分析算法的执行效率和资源消耗
  • 将本地某个commit 提交另一个分支上
  • Unity BesHttp插件修改Error log的格式
  • 数字信封原理解析:安全高效,一次一密!
  • 基于Hadoop和Hive的健康保险数据分析
  • 现代Web酒店客房管理:基于Spring Boot的实现
  • Linux scp命令语法
  • 00 硬件、嵌入式硬件知识-目录篇
  • R语言机器学习算法实战系列(十五)随机森林生存预后模型+SHAP值 (Random Survival Forest + SHAP)
  • AI虚拟主播实时互动模块的搭建与开发!
  • XSS小游戏【1-13关】
  • HTML入门教程22:HTML文件路径
  • 物联网监控数据采集,传输和存储方案:使用 GreptimeDB 和 YoMo