k8s集群中部署dcgm-exporter收集GPU指标
总体步骤:
-
部署dcgm-exporter的DaemonSet和Service,确保Service有正确的标签和端口。
-
创建ServiceMonitor,选择dcgm-exporter的Service,并指定端口。
-
检查Prometheus的targets页面,确认dcgm-exporter是否被正确发现和抓取。
-
可能需要调整Prometheus的RBAC或网络策略,确保访问权限。
1,部署dcgm-exporter
创建dcgm-exporter.yaml文件,包含DaemonSet和Service:
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring # 假设与kube-prometheus同命名空间
labels:
app: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/gpu.present: "true" # 仅在有GPU的节点运行
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: dcgm-exporter
image: nvidia/dcgm-exporter:3.3.4-3.6.12-ubuntu22.04
ports:
- containerPort: 9400
resources:
limits:
nvidia.com/gpu: 1 # 分配1个GPU
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app: dcgm-exporter # ServiceMonitor通过此标签选择
spec:
selector:
app: dcgm-exporter
ports:
- name: metrics
port: 9400
targetPort: 9400
应用配置:
kubectl apply -f dcgm-exporter.yaml
2. 创建ServiceMonitor资源
创建dcgm-service-monitor.yaml文件,定义如何抓取指标:
# dcgm-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring # 必须与Prometheus Operator监控的命名空间匹配
labels:
release: kube-prometheus # 根据实际kube-prometheus部署的标签调整
spec:
jobLabel: dcgm-exporter
endpoints:
- port: metrics # 对应Service的端口名称
interval: 15s
path: /metrics # 指标路径
selector:
matchLabels:
app: dcgm-exporter # 匹配Service的标签
namespaceSelector:
matchNames:
- monitoring # 指定Service所在的命名空间
应用配置:
kubectl apply -f dcgm-service-monitor.yaml
3,验证配置
检查Pod状态
kubectl get pods -n monitoring -l app=dcgm-exporter
查看Service和Endpoints:
kubectl get svc,ep -n monitoring dcgm-exporter
访问Prometheus UI:
1,端口转发Prometheus服务:
kubectl --namespace monitoring port-forward svc/prometheus-operated 9090
2,打开浏览器访问 http://localhost:9090/targets
,确认dcgm-exporter目标状态为UP。
查询GPU指标:
在Prometheus中输入DCGM_FI_DEV_GPU_UTIL
验证指标是否存在。