HAMi + prometheus-k8s + grafana实现vgpu虚拟化监控
最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新
回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控
先说下为啥要用HAMi吧, 一个重要原因是公司有人引见了这个工具的作者, 很多问题我都可以直接向作者提问
HAMi,是一个国产的GPU与国产加速卡(支持的GPU与国产加速卡型号与具体特性请查看此项目官网:https://github.com/Project-HAMi/HAMi/)虚拟化开源项目,实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名“k8s-vGPU-scheduler”,
最初由我司开源,现已在国内与国际上愈加流行,是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备(如GPU、NPU等),在Pod之间共享异构设备,根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性,本文只提供一种可行的办法,最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。
本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的,相关组件或软件版本信息如下:
组件或软件名称 | 版本 | 备注 |
---|---|---|
kubernetes集群 | v1.23.1 | AMD64构架服务器环境下 |
HAMi | 根据向开源作者提问,当前HAMi版本发行机制还不够成熟,暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本,此值要跟kubernetes版本看齐 | 项目地址:https://github.com/Project-HAMi/HAMi/ |
kube-prometheus stack | prom/prometheus:v2.27.1 | 关于监控的安装参见实现prometheus+grafana的监控部署_prometheus grafana监控部署-CSDN博客 |
dcgm-exporter | nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04 |
HAMi 的默认安装方式是通过helm,添加Helm仓库:
helm repo add hami-charts https://project-hami.github.io/HAMi/
检查Kubernetes版本并安装HAMi(服务器版本为1.23.1):
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system
验证hami安装成功
kubectl get pods -n kube-system
确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。
把helm安装转为hami-install.yaml
helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system > hami-install.yaml
该格式部署
---
# Source: hami/templates/device-plugin/monitorserviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: hami-device-plugin
namespace: "kube-system"
labels:
app.kubernetes.io/component: "hami-device-plugin"
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/scheduler/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: hami-scheduler
namespace: "kube-system"
labels:
app.kubernetes.io/component: "hami-scheduler"
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/device-plugin/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-device-plugin
labels:
app.kubernetes.io/component: hami-device-plugin
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
data:
config.json: |
{
"nodeconfig": [
{
"name": "m5-cloudinfra-online02",
"devicememoryscaling": 1.8,
"devicesplitcount": 10,
"migstrategy":"none",
"filterdevices": {
"uuid": [],
"index": []
}
}
]
}
---
# Source: hami/templates/scheduler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-scheduler
labels:
app.kubernetes.io/component: hami-scheduler
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
data:
config.json: |
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix": "https://127.0.0.1:443",
"filterVerb": "filter",
"bindVerb": "bind",
"enableHttps": true,
"weight": 1,
"nodeCacheCapable": true,
"httpTimeout": 30000000000,
"tlsConfig": {
"insecure": true
},
"managedResources": [
{
"name": "nvidia.com/gpu",
"ignoredByScheduler": true
},
{
"name": "nvidia.com/gpumem",
"ignoredByScheduler": true
},
{
"name": "nvidia.com/gpucores",
"ignoredByScheduler": true
},
{
"name": "nvidia.com/gpumem-percentage",
"ignoredByScheduler": true
},
{
"name": "nvidia.com/priority",
"ignoredByScheduler": true
},
{
"name": "cambricon.com/vmlu",
"ignoredByScheduler": true
},
{
"name": "hygon.com/dcunum",
"ignoredByScheduler": true
},
{
"name": "hygon.com/dcumem",
"ignoredByScheduler": true
},
{
"name": "hygon.com/dcucores",
"ignoredByScheduler": true
},
{
"name": "iluvatar.ai/vgpu",
"ignoredByScheduler": true
}
],
"ignoreable": false
}
]
}
---
# Source: hami/templates/scheduler/configmapnew.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-scheduler-newversion
labels:
app.kubernetes.io/component: hami-scheduler
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
profiles:
- schedulerName: hami-scheduler
extenders:
- urlPrefix: "https://127.0.0.1:443"
filterVerb: filter
bindVerb: bind
nodeCacheCapable: true
weight: 1
httpTimeout: 30s
enableHTTPS: true
tlsConfig:
insecure: true
managedResources:
- name: nvidia.com/gpu
ignoredByScheduler: true
- name: nvidia.com/gpumem
ignoredByScheduler: true
- name: nvidia.com/gpucores
ignoredByScheduler: true
- name: nvidia.com/gpumem-percentage
ignoredByScheduler: true
- name: nvidia.com/priority
ignoredByScheduler: true
- name: cambricon.com/vmlu
ignoredByScheduler: true
- name: hygon.com/dcunum
ignoredByScheduler: true
- name: hygon.com/dcumem
ignoredByScheduler: true
- name: hygon.com/dcucores
ignoredByScheduler: true
- name: iluvatar.ai/vgpu
ignoredByScheduler: true
---
# Source: hami/templates/scheduler/device-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-scheduler-device
labels:
app.kubernetes.io/component: hami-scheduler
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
data:
device-config.yaml: |-
nvidia:
resourceCountName: nvidia.com/gpu
resourceMemoryName: nvidia.com/gpumem
resourceMemoryPercentageName: nvidia.com/gpumem-percentage
resourceCoreName: nvidia.com/gpucores
resourcePriorityName: nvidia.com/priority
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
cambricon:
resourceCountName: cambricon.com/vmlu
resourceMemoryName: cambricon.com/mlu.smlu.vmemory
resourceCoreName: cambricon.com/mlu.smlu.vcore
hygon:
resourceCountName: hygon.com/dcunum
resourceMemoryName: hygon.com/dcumem
resourceCoreName: hygon.com/dcucores
metax:
resourceCountName: "metax-tech.com/gpu"
mthreads:
resourceCountName: "mthreads.com/vgpu"
resourceMemoryName: "mthreads.com/sgpu-memory"
resourceCoreName: "mthreads.com/sgpu-core"
iluvatar:
resourceCountName: iluvatar.ai/vgpu
resourceMemoryName: iluvatar.ai/vcuda-memory
resourceCoreName: iluvatar.ai/vcuda-core
vnpus:
- chipName: 910B
commonWord: Ascend910A
resourceName: huawei.com/Ascend910A
resourceMemoryName: huawei.com/Ascend910A-memory
memoryAllocatable: 32768
memoryCapacity: 32768
aiCore: 30
templates:
- name: vir02
memory: 2184
aiCore: 2
- name: vir04
memory: 4369
aiCore: 4
- name: vir08
memory: 8738
aiCore: 8
- name: vir16
memory: 17476
aiCore: 16
- chipName: 910B3
commonWord: Ascend910B
resourceName: huawei.com/Ascend910B
resourceMemoryName: huawei.com/Ascend910B-memory
memoryAllocatable: 65536
memoryCapacity: 65536
aiCore: 20
aiCPU: 7
templates:
- name: vir05_1c_16g
memory: 16384
aiCore: 5
aiCPU: 1
- name: vir10_3c_32g
memory: 32768
aiCore: 10
aiCPU: 3
- chipName: 310P3
commonWord: Ascend310P
resourceName: huawei.com/Ascend310P
resourceMemoryName: huawei.com/Ascend310P-memory
memoryAllocatable: 21527
memoryCapacity: 24576
aiCore: 8
aiCPU: 7
templates:
- name: vir01
memory: 3072
aiCore: 1
aiCPU: 1
- name: vir02
memory: 6144
aiCore: 2
aiCPU: 2
- name: vir04
memory: 12288
aiCore: 4
aiCPU: 4
---
# Source: hami/templates/device-plugin/monitorrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hami-device-plugin-monitor
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- create
- watch
- list
- update
- patch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- update
- list
- patch
---
# Source: hami/templates/device-plugin/monitorrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hami-device-plugin
labels:
app.kubernetes.io/component: "hami-device-plugin"
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
#name: cluster-admin
name: hami-device-plugin-monitor
subjects:
- kind: ServiceAccount
name: hami-device-plugin
namespace: "kube-system"
---
# Source: hami/templates/scheduler/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hami-scheduler
labels:
app.kubernetes.io/component: "hami-scheduler"
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: hami-scheduler
namespace: "kube-system"
---
# Source: hami/templates/device-plugin/monitorservice.yaml
apiVersion: v1
kind: Service
metadata:
name: hami-device-plugin-monitor
labels:
app.kubernetes.io/component: hami-device-plugin
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
spec:
externalTrafficPolicy: Local
selector:
app.kubernetes.io/component: hami-device-plugin
type: NodePort
ports:
- name: monitorport
port: 31992
targetPort: 9394
nodePort: 31992
---
# Source: hami/templates/scheduler/service.yaml
apiVersion: v1
kind: Service
metadata:
name: hami-scheduler
labels:
app.kubernetes.io/component: hami-scheduler
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
spec:
type: NodePort
ports:
- name: http
port: 443
targetPort: 443
nodePort: 31998
protocol: TCP
- name: monitor
port: 31993
targetPort: 9395
nodePort: 31993
protocol: TCP
selector:
app.kubernetes.io/component: hami-scheduler
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
---
# Source: hami/templates/device-plugin/daemonsetnvidia.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hami-device-plugin
labels:
app.kubernetes.io/component: hami-device-plugin
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
template:
metadata:
labels:
app.kubernetes.io/component: hami-device-plugin
hami.io/webhook: ignore
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
spec:
imagePullSecrets:
[]
serviceAccountName: hami-device-plugin
priorityClassName: system-node-critical
hostPID: true
hostNetwork: true
containers:
- name: device-plugin
image: projecthami/hami:latest
imagePullPolicy: "IfNotPresent"
lifecycle:
postStart:
exec:
command: ["/bin/sh","-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
command:
- nvidia-device-plugin
- --config-file=/device-config.yaml
- --mig-strategy=none
- --disable-core-limit=false
- -v=false
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: HOOK_PATH
value: /usr/local
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
add: ["SYS_ADMIN"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: lib
mountPath: /usr/local/vgpu
- name: usrbin
mountPath: /usrbin
- name: deviceconfig
mountPath: /config
- name: hosttmp
mountPath: /tmp
- name: device-config
mountPath: /device-config.yaml
subPath: device-config.yaml
- name: vgpu-monitor
image: projecthami/hami:latest
imagePullPolicy: "IfNotPresent"
command: ["vGPUmonitor"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
add: ["SYS_ADMIN"]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: HOOK_PATH
value: /usr/local/vgpu
volumeMounts:
- name: ctrs
mountPath: /usr/local/vgpu/containers
- name: dockers
mountPath: /run/docker
- name: containerds
mountPath: /run/containerd
- name: sysinfo
mountPath: /sysinfo
- name: hostvar
mountPath: /hostvar
volumes:
- name: ctrs
hostPath:
path: /usr/local/vgpu/containers
- name: hosttmp
hostPath:
path: /tmp
- name: dockers
hostPath:
path: /run/docker
- name: containerds
hostPath:
path: /run/containerd
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: lib
hostPath:
path: /usr/local/vgpu
- name: usrbin
hostPath:
path: /usr/bin
- name: sysinfo
hostPath:
path: /sys
- name: hostvar
hostPath:
path: /var
- name: deviceconfig
configMap:
name: hami-device-plugin
- name: device-config
configMap:
name: hami-scheduler-device
nodeSelector:
gpu: "on"
---
# Source: hami/templates/scheduler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hami-scheduler
labels:
app.kubernetes.io/component: hami-scheduler
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: hami-scheduler
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
template:
metadata:
labels:
app.kubernetes.io/component: hami-scheduler
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
hami.io/webhook: ignore
spec:
imagePullSecrets:
[]
serviceAccountName: hami-scheduler
priorityClassName: system-node-critical
containers:
- name: kube-scheduler
image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0
imagePullPolicy: "IfNotPresent"
command:
- kube-scheduler
- --config=/config/config.yaml
- -v=4
- --leader-elect=true
- --leader-elect-resource-name=hami-scheduler
- --leader-elect-resource-namespace=kube-system
volumeMounts:
- name: scheduler-config
mountPath: /config
- name: vgpu-scheduler-extender
image: projecthami/hami:latest
imagePullPolicy: "IfNotPresent"
env:
command:
- scheduler
- --http_bind=0.0.0.0:443
- --cert_file=/tls/tls.crt
- --key_file=/tls/tls.key
- --scheduler-name=hami-scheduler
- --metrics-bind-address=:9395
- --node-scheduler-policy=binpack
- --gpu-scheduler-policy=spread
- --device-config-file=/device-config.yaml
- --debug
- -v=4
ports:
- name: http
containerPort: 443
protocol: TCP
volumeMounts:
- name: tls-config
mountPath: /tls
- name: device-config
mountPath: /device-config.yaml
subPath: device-config.yaml
volumes:
- name: tls-config
secret:
secretName: hami-scheduler-tls
- name: scheduler-config
configMap:
name: hami-scheduler-newversion
- name: device-config
configMap:
name: hami-scheduler-device
---
# Source: hami/templates/scheduler/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: hami-webhook
webhooks:
- admissionReviewVersions:
- v1beta1
clientConfig:
service:
name: hami-scheduler
namespace: kube-system
path: /webhook
port: 443
failurePolicy: Ignore
matchPolicy: Equivalent
name: vgpu.hami.io
namespaceSelector:
matchExpressions:
- key: hami.io/webhook
operator: NotIn
values:
- ignore
objectSelector:
matchExpressions:
- key: hami.io/webhook
operator: NotIn
values:
- ignore
reinvocationPolicy: Never
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
resources:
- pods
scope: '*'
sideEffects: None
timeoutSeconds: 10
---
# Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: hami-admission
annotations:
"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
---
# Source: hami/templates/scheduler/job-patch/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hami-admission
annotations:
"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
rules:
- apiGroups:
- admissionregistration.k8s.io
resources:
#- validatingwebhookconfigurations
- mutatingwebhookconfigurations
verbs:
- get
- update
---
# Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hami-admission
annotations:
"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: hami-admission
subjects:
- kind: ServiceAccount
name: hami-admission
namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: hami-admission
annotations:
"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
rules:
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- create
---
# Source: hami/templates/scheduler/job-patch/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: hami-admission
annotations:
"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: hami-admission
subjects:
- kind: ServiceAccount
name: hami-admission
namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: hami-admission-create
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
spec:
template:
metadata:
name: hami-admission-create
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
hami.io/webhook: ignore
spec:
imagePullSecrets:
[]
containers:
- name: create
image: liangjw/kube-webhook-certgen:v1.1.1
imagePullPolicy: IfNotPresent
args:
- create
- --cert-name=tls.crt
- --key-name=tls.key
- --host=hami-scheduler.kube-system.svc,127.0.0.1
- --namespace=kube-system
- --secret-name=hami-scheduler-tls
restartPolicy: OnFailure
serviceAccountName: hami-admission
securityContext:
runAsNonRoot: true
runAsUser: 2000
---
# Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: hami-admission-patch
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
spec:
template:
metadata:
name: hami-admission-patch
labels:
helm.sh/chart: hami-2.4.0
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
app.kubernetes.io/version: "2.4.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: admission-webhook
hami.io/webhook: ignore
spec:
imagePullSecrets:
[]
containers:
- name: patch
image: liangjw/kube-webhook-certgen:v1.1.1
imagePullPolicy: IfNotPresent
args:
- patch
- --webhook-name=hami-webhook
- --namespace=kube-system
- --patch-validating=false
- --secret-name=hami-scheduler-tls
restartPolicy: OnFailure
serviceAccountName: hami-admission
securityContext:
runAsNonRoot: true
runAsUser: 2000
部署dcgm-exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "3.6.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "3.6.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "3.6.1"
name: "dcgm-exporter"
spec:
containers:
- image: "nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "3.6.1"
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "3.6.1"
ports:
- name: "metrics"
port: 9400
dcgm-exporter安装成功
参考这个hami-vgpu dashboard 下载panel 的json文件
hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为“hami-vgpu-dashboard”的dashboard,但此页面中有一些Panel如vGPUCorePercentage还没有数据
ServiceMonitor
是 Prometheus Operator 中的一个自定义资源,主要用于监控 Kubernetes 中的服务。它的作用包括:
1. 自动化发现
ServiceMonitor
允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor
,您可以告诉 Prometheus 监控特定服务的端点。
2. 配置抓取参数
您可以在 ServiceMonitor
中设置抓取的相关参数,例如:
- 抓取间隔:定义 Prometheus 多频繁抓取数据(如每 30 秒)。
- 超时:定义抓取请求的超时时间。
- 标签选择器:指定要监控的服务的标签,确保 Prometheus 仅抓取相关服务的数据。
dcgm-exporter需要配置两个service monitor
hami-device-plugin-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-device-plugin-svc-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
namespaceSelector:
matchNames:
- kube-system
endpoints:
- path: /metrics
port: monitorport
interval: "15s"
honorLabels: false
relabelings:
- sourceLabels: [__meta_kubernetes_endpoints_name]
regex: hami-.*
replacement: $1
action: keep
- sourceLabels: [__meta_kubernetes_pod_node_name]
regex: (.*)
targetLabel: node_name
replacement: ${1}
action: replace
- sourceLabels: [__meta_kubernetes_pod_host_ip]
regex: (.*)
targetLabel: ip
replacement: $1
action: replace
hami-scheduler-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-scheduler-svc-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-scheduler
namespaceSelector:
matchNames:
- kube-system
endpoints:
- path: /metrics
port: monitor
interval: "15s"
honorLabels: false
relabelings:
- sourceLabels: [__meta_kubernetes_endpoints_name]
regex: hami-.*
replacement: $1
action: keep
- sourceLabels: [__meta_kubernetes_pod_node_name]
regex: (.*)
targetLabel: node_name
replacement: ${1}
action: replace
- sourceLabels: [__meta_kubernetes_pod_host_ip]
regex: (.*)
targetLabel: ip
replacement: $1
action: replace
确认创建的ServiceMonitor
启动gpu pod一个测试下
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-1
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 1000
nvidia.com/gpucores: 10
如果看到pod一直pending 状态
检查下节点如果出现下面gpu为0的情况
需要
docker:
1:下载NVIDIA-DOCKER2安装包并安装
2:修改/etc/docker/daemon.json文件内容加上
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
}
k8s:
1:下载k8s-device-plugin 镜像
2:编写nvidia-device-plugin.yml创建驱动pod
使用这个yml进行创建
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:1.11
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
gpu pod启动后进入查看下, gpu内存和限制的大小相同设置成功
访问下{scheduler node ip}:31993/metrics
日志最后有两行
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-1",podnamespace="default",zone="vGPU"} 1.048576e+10 vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-2",podnamespace="default",zone="vGPU"} 1.048576e+10
可以看到相同deviceuuid的gpu被不同pod共享使用
exec进入hami-device-plugin daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息
root@node126:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b)
root@node126:/#
之前创建的两个serviceMonitor会去请求
app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics 接口获取数据
当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard