opensatck-海光C86芯片的计算节点直通英伟达T4 GPU加速卡的操作记录
文章目录
- 前言
- 一、检查物理机上GPU加速卡状态
- 第一台
- 第二台
- X86机器上的信息用于对比
- 二、配置直通信息(前提BIOS中打开IOMMU配置)
- 1.修改内核
- 2.nova-compute增加pci配置信息,修改好之后直接重启就行
- 3.控制节点增加pci配置信息
- 三、确认加速卡信息已经被加载到数据库中
- 四、创建加速卡专用的配置类型,配置元数据
- 五、使用专用配置创建GPU云主机,检查直通结果
- 总结
前言
来了两台台信创的机器,尝试加入计算集群供同事测试。
两台的CPU都是Hygon C86 7285 32-core Processor。
第一台没直通成功;
第二台直通成功了。
一、检查物理机上GPU加速卡状态
使用lscpi -v命令检查即可
第一台
63:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
Subsystem: NVIDIA Corporation Device 12a2
Physical Slot: 29
Flags: fast devsel, IRQ 255, NUMA node 3
Memory at <unassigned> (64-bit, prefetchable) [disabled]
Memory at <unassigned> (64-bit, prefetchable) [disabled]
Expansion ROM at <unassigned> [disabled]
Capabilities: [60] Power Management version 3
Capabilities: [68] #00 [0080]
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Capabilities: [bb0] #15
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: vfio-pci
Kernel modules: nouveau
两个memory都是disabled的,直接放弃
第二台
71:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
Subsystem: NVIDIA Corporation Device 12a2
Flags: bus master, fast devsel, latency 0, IRQ 747, NUMA node 7
Memory at d9000000 (32-bit, non-prefetchable) [size=16M]
Memory at 16fd0000000 (64-bit, prefetchable) [size=256M]
Memory at 17000000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at <unassigned> [disabled]
Capabilities: [60] Power Management version 3
Capabilities: [68] #00 [0080]
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Capabilities: [bb0] #15
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: vfio-pci
Kernel modules: nouveau
X86机器上的信息用于对比
d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation Device 12a2
Flags: bus master, fast devsel, latency 0, IRQ 372, NUMA node 1
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at 39ffc0000000 (64-bit, prefetchable) [size=256M]
Memory at 39fff0000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] #00 [0080]
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Capabilities: [bb0] #15
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: vfio-pci
Kernel modules: nouveau
二、配置直通信息(前提BIOS中打开IOMMU配置)
1.修改内核
/etc/default/grub文件增加amd_iommu=on iommu=pt 参数
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto amd_iommu=on iommu=pt rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
生成新的grub文件
grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
2.nova-compute增加pci配置信息,修改好之后直接重启就行
/etc/kolla/nova-compute/nova.conf
[pci]
passthrough_whitelist = {"vendor_id":"10de","product_id":"1eb8"}
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8","device_type":"type-PF"}
备注:如果拿不准自己的卡是type-PF 还是type-PCI,最好不填,填错的话直通会失败。
3.控制节点增加pci配置信息
/etc/kolla/nova-api/nova.conf
scheduler_default_filters= AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,IsolatedHostsFilter
[pci]
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8","device_type":"type-PF"}
/etc/kolla/nova-conductor/nova.conf
scheduler_default_filters= AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,IsolatedHostsFilter
[pci]
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8" ,"device_type":"type-PF"}
/etc/kolla/nova-scheduler/nova.conf
[filter_scheduler]
enabled_filters=AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,IsolatedHostsFilter
[pci]
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8","device_type":"type-PF"}
重启对应服务
docker restart nova_api nova_conductor nova_scheduler
三、确认加速卡信息已经被加载到数据库中
查看nova.pci_devices表,找到了这个卡的信息
四、创建加速卡专用的配置类型,配置元数据
元数据配置
pci_passthrough:alias
Tesla T4:1
五、使用专用配置创建GPU云主机,检查直通结果
创建测试云主机
检查云主机内直通结果
直通成功,交付给同事使用
总结
信创服务器和X86服务器上直通的步骤一样 就改下grub参数就行