IREE和TensorRT性能对比
IREE和TensorRT性能对比
- 一.背景
- 二.操作步骤
- 1.创建容器
- 2.安装依赖
- 3.获取设备信息
- 4.生成onnx模型
- 5.采用trtexec测试模型的性能
- 6.采用IREE运行resnet50[Pytorch-AOT方式]
- 7.采用IREE运行resnet50[ONNX导入的方式]
一.背景
- 想对比在同一张GPU、相同的模型下,IREE与TensorRT的性能差异
- 测试结果:TRT:1912.23FPS IREE:87.23 FPS
- IREE没有用上tensorcore,还不知道什么原因
二.操作步骤
1.创建容器
mkdir iree_trt_bench
cd iree_trt_bench
docker stop iree_trt_bench
docker rm iree_trt_bench
docker run --gpus all --shm-size=32g -id -e NVIDIA_VISIBLE_DEVICES=all --privileged \
-v $PWD:/home -w /home \
--name iree_trt_bench --hostname=iree_trt_bench nvcr.io/nvidia/pytorch:23.07-py3 /bin/bash
docker exec -ti iree_trt_bench bash
2.安装依赖
pip uninstall torch
python -m pip install torch -i https://pypi.tuna.tsinghua.edu.cn/simple/
python -m pip install iree-turbine -i https://pypi.tuna.tsinghua.edu.cn/simple/
python -m pip install iree-base-compiler[onnx] iree-base-runtime -i https://pypi.tuna.tsinghua.edu.cn/simple/
3.获取设备信息
iree-run-module --list_drivers
iree-run-module --list_devices
输出
cuda: NVIDIA CUDA HAL driver (via dylib)
hip: HIP HAL driver (via dylib)
local-sync: Local execution using a lightweight inline synchronous queue
local-task: Local execution using the IREE multithreading task system
vulkan: Vulkan 1.x (dynamic)
cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0
local-sync://
local-task://
4.生成onnx模型
cat> resnet50.py<<-'EOF'
import torchvision.transforms as transforms
import torch
import torchvision.models as models
# 转为half类型
input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
model = models.resnet50(pretrained=False).half()
model.eval()
script_model = torch.jit.trace(model, input_tensor,)
with torch.no_grad():
output = model(input_tensor)
input_names = ["input"]
output_names = ["output"]
torch.onnx.export(model, input_tensor, "resnet50.onnx",
verbose=False, input_names=input_names,
output_names=output_names,opset_version=17,export_params=True)
EOF
python resnet50.py
5.采用trtexec测试模型的性能
# int8
trtexec --onnx=resnet50.onnx \
--int8 --useCudaGraph --streams=3 --threads --useSpinWait --useManagedMemory \
--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000 2>&1 | egrep "Throughput|Selected Device"
# fp16
trtexec --onnx=resnet50.onnx \
--fp16 --useCudaGraph --streams=3 --threads --useSpinWait --useManagedMemory \
--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000 2>&1 | egrep "Throughput|Selected Device"
# fp16单stream
trtexec --onnx=resnet50.onnx \
--fp16 --streams=1 --useSpinWait --noDataTransfers=0 \
--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000 2>&1 | egrep "Throughput|Selected Device"
# profing
trtexec --onnx=resnet50.onnx \
--fp16 --streams=1 --useSpinWait --noDataTransfers=0 --saveEngine="resnet50.engine" --skipInference \
--duration=30 --device=0 --memPoolSize=workspace:15360MiB
nsys profile --stats=true -o cuda_profing_report.nsys-rep -f true -t cuda,nvtx trtexec --loadEngine="resnet50.engine" \
--fp16 --streams=1 --useSpinWait --noDataTransfers=0 \
--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000
输出
[01/08/2025-09:36:29] [I] Selected Device: NVIDIA GeForce RTX 3090
[01/08/2025-09:37:47] [I] Throughput: 4507.64 qps
[01/08/2025-09:38:18] [I] Selected Device: NVIDIA GeForce RTX 3090
[01/08/2025-09:39:31] [I] Throughput: 3253.95 qps
[01/08/2025-09:39:50] [I] Selected Device: NVIDIA GeForce RTX 3090
[01/08/2025-09:41:03] [I] Throughput: 1912.23 qps
[5/7] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
31.0 4910127944 436475 11249.5 10463.0 7073 20544 3779.4 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage5_warpsize2x2x1…
11.4 1804272642 1578025 1143.4 1120.0 1056 1632 55.1 void genericReformat::copyVectorizedKernel<double, __half, __half, (bool)1, (bool)0, (int)1>(unsign…
10.9 1724732099 235025 7338.5 6368.0 6175 11232 1515.0 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage5_warpsize2x2x1…
9.2 1464470288 268600 5452.2 5153.0 4160 7936 759.1 sm80_xmma_fprop_implicit_gemm_indexed_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage4_warps…
5.1 802538557 134300 5975.7 5568.0 5087 9120 1124.2 sm80_xmma_fprop_implicit_gemm_indexed_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage5_warps…
4.8 762171900 100725 7566.9 6816.0 6304 10464 1247.3 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize2x2x1…
4.2 666389805 100725 6615.9 6560.0 6272 7904 177.1 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize2x1x2…
4.2 659694863 100725 6549.5 6432.0 6208 7968 235.9 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x128x32_stage4_warpsize2x2…
3.3 524938823 100725 5211.6 5056.0 4928 6560 249.7 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x128x32_stage5_warpsize2x2x…
2.2 351567495 33575 10471.1 10464.0 9984 11936 173.8 sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x64_sta…
2.2 346470431 67150 5159.6 5216.0 4800 6240 213.7 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x32_stage4_warpsize4x1x…
1.6 250187681 33575 7451.6 7456.0 7296 7968 54.4 sm50_xmma_cublas_gemvx_f16f16_f32_f32_tn_n_int32_unit_n_launch_param16x32x32_strided_unit_stride_ex…
1.4 229378733 33575 6831.8 6816.0 6433 7616 118.5 trt_ampere_h16816gemm_64x64_ldg8_relu_stages_64x5_nn_v1
1.4 225713543 33575 6722.7 6720.0 6528 7808 100.7 sm80_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x128x32_stage5_warpsize2x2x1_tensor16x8x16_execute_…
1.4 225337960 33575 6711.5 6719.0 6528 7872 103.8 sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x64x32_stage5_warpsize2x2x…
1.2 192754632 33575 5741.0 5728.0 5600 6625 80.4 sm80_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize256x64x32_stage3_warpsize4x1x1_tensor16x8x16_execute_…
1.1 167363383 33575 4984.8 4992.0 4832 5856 78.8 trt_ampere_h16816gemm_64x64_ldg8_relu_stages_64x5_tn_v1
1.0 163191373 33575 4860.5 4832.0 4736 5792 78.0 trt_ampere_h16816gemm_128x64_ldg8_relu_stages_32x6_tn_v1
0.8 128487407 33575 3826.9 3809.0 3617 4607 68.2 sm50_xmma_pooling_max_nhwc_FP16FP32_WINDOWSIZE_3_PROPAGATE_NAN_2D_execute_kernel_trt
0.7 106704463 33575 3178.1 3168.0 3040 3968 53.9 void genericReformat::copyPackedKernel<float, __half, (bool)1, (bool)1, genericReformat::ArrayN<(in…
0.6 100768429 33575 3001.3 3008.0 2912 3488 50.7 sm50_xmma_pooling_fw_4d_FP16FP32NHWC_Average_FastDiv_CAlign4_execute_kernel_trt
0.2 39201047 33575 1167.6 1153.0 1119 1375 21.8 void genericReformat::copyVectorizedKernel<double, __half, float, (bool)1, (bool)0, (int)1>(unsigne…
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- --------- --------- -------- -------- ----------- ------------------
99.9 9934338 2 4967169.0 4967169.0 63104 9871234 6935395.2 [CUDA memcpy HtoD]
0.1 9023 5 1804.6 1632.0 672 4320 1483.4 [CUDA memset]
0.0 3839 3 1279.7 1279.0 1248 1312 32.0 [CUDA memcpy DtoD]
0.0 1344 1 1344.0 1344.0 1344 1344 0.0 [CUDA memcpy DtoH]
6.采用IREE运行resnet50[Pytorch-AOT方式]
cat> iree_forward_resnet50.py<<-'EOF'
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time
input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
model = models.resnet50(pretrained=False).half()
model.eval()
export_output = aot.export(model, input_tensor)
export_output.save_mlir("resnet50.mlir")
compiled_binary = export_output.compile(save_to=None,target_backends="cuda")
config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
rt.VmModule.copy_buffer(config.vm_instance, compiled_binary.map_memory()),
config)
# warm up
for i in range(3):
y = vmm.main(input_tensor)
# benchmark
t0=time.time()
for i in range(1000):
y = vmm.main(input_tensor)
t1=time.time()
print("{:.2f} FPS".format(1000/(t1-t0)))
EOF
python iree_forward_resnet50.py
nsys profile --stats=true -o cuda_profing_report.nsys-rep -f true -t cuda,nvtx python iree_forward_resnet50.py
输出
88.79 FPS
[5/7] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- -------------------------------------------------------------------------
16.6 1676559094 2006 835772.2 833919.5 825087 946527 12354.4 main_async_dispatch_69_conv_2d_nchw_fchw_1x512x7x7x512x3x3_f16xf16xf32
11.2 1129917546 5015 225307.6 223968.0 220224 282368 5630.3 main_async_dispatch_43_conv_2d_nchw_fchw_1x256x14x14x256x3x3_f16xf16xf32
8.4 852021677 1003 849473.3 867104.0 437792 992511 69058.9 main_async_dispatch_0_slow_memcpy
8.0 813216409 1003 810784.1 809056.0 800383 917631 11968.5 main_async_dispatch_63_conv_2d_nchw_fchw_1x512x7x7x512x3x3_f16xf16xf32
7.8 792092582 5015 157944.7 157632.0 154784 418303 4396.8 main_async_dispatch_42_matmul_like_256x14x14x1024_f16xf16xf32
5.4 544241759 2006 271307.0 270784.0 267328 307744 4031.4 main_async_dispatch_68_matmul_like_512x7x7x2048_f16xf16xf32
5.3 537730923 5015 107224.5 107008.0 105536 121600 1571.5 main_async_dispatch_45_matmul_like_1024x14x14x256_f16xf16xf32
3.1 311051603 3009 103373.7 103168.0 101535 117409 1559.3 main_async_dispatch_25_conv_2d_nchw_fchw_1x128x28x28x128x3x3_f16xf16xf32
2.7 277745062 1003 276914.3 276415.0 272480 313439 4107.9 main_async_dispatch_62_matmul_like_512x14x14x1024_f16xf16xf32
2.7 274588058 3009 91255.6 91104.0 76992 104000 1469.5 main_async_dispatch_27_matmul_like_512x28x28x128_f16xf16xf32
2.5 253636478 1003 252877.8 257440.0 219104 297728 9518.0 main_async_dispatch_37_conv_2d_nchw_fchw_1x256x14x14x256x3x3_f16xf16xf32
2.0 203783616 1003 203174.1 202720.0 200800 229600 2944.9 main_async_dispatch_71_matmul_like_2048x7x7x512_f16xf16xf32
2.0 203007098 1003 202399.9 200288.0 193056 240479 5555.6 main_async_dispatch_18_matmul_like_128x56x56x256_f16xf16xf32
2.0 197697763 1003 197106.4 196672.0 194400 224128 2932.8 main_async_dispatch_66_conv_2d_nchw_fchw_1x2048x7x7x1024x1x1_f16xf16xf32
1.7 170180476 2006 84835.7 84544.0 79135 99168 2180.3 main_async_dispatch_13_matmul_like_256x56x56x64_f16xf16xf32
1.6 158539120 1003 158064.9 157727.0 156224 179104 2349.6 main_async_dispatch_36_matmul_like_256x28x28x512_f16xf16xf32
1.5 150716523 1003 150265.7 149504.0 146367 398271 8291.8 main_async_dispatch_22_conv_2d_nchw_fchw_1x512x28x28x256x1x1_f16xf16xf32
1.5 150515646 1003 150065.4 149759.0 148288 169761 2202.5 main_async_dispatch_24_matmul_like_128x28x28x512_f16xf16xf32
1.5 150248820 1003 149799.4 149472.0 147968 169600 2211.9 main_async_dispatch_28_matmul_like_128x28x28x512_f16xf16xf32
1.5 149922213 1003 149473.8 149151.0 147583 169216 2209.9 main_async_dispatch_32_matmul_like_128x28x28x512_f16xf16xf32
1.5 148435384 2006 73995.7 73888.0 72224 84544 1155.5 main_async_dispatch_65_matmul_like_2048x49x512_f16xf16xf32
1.4 143103376 3009 47558.4 47424.0 46560 54559 788.6 main_async_dispatch_6_conv_2d_nchw_fchw_1x64x56x56x64x3x3_f16xf16xf32
1.2 126157349 1003 125780.0 125567.0 120128 144992 2611.2 main_async_dispatch_9_matmul_like_256x56x56x64_f16xf16xf32
1.0 96413797 1003 96125.4 96000.0 93504 111968 1687.2 main_async_dispatch_1_conv_2d_nchw_fchw_1x64x112x112x3x7x7_f16xf16xf32
0.9 91943170 1003 91668.2 91488.0 90016 105024 1403.8 main_async_dispatch_19_conv_2d_nchw_fchw_1x128x28x28x128x3x3_f16xf16xf32
0.8 84790670 1003 84537.1 84384.0 82656 96384 1336.5 main_async_dispatch_40_conv_2d_nchw_fchw_1x1024x14x14x512x1x1_f16xf16xf32
0.7 66988042 1003 66787.7 66528.0 65312 81184 1286.4 main_async_dispatch_10_matmul_like_64x56x56x256_f16xf16xf32
0.7 66982463 1003 66782.1 66528.0 65440 76480 1184.4 main_async_dispatch_14_matmul_like_64x56x56x256_f16xf16xf32
0.6 59671134 1003 59492.7 59360.0 58272 68544 990.6 main_async_dispatch_39_matmul_like_1024x196x256_f16xf16xf32
0.5 55640924 1003 55474.5 55360.0 54560 63488 894.0 main_async_dispatch_21_matmul_like_512x784x128_f16xf16xf32
0.5 46757807 1003 46618.0 46528.0 45759 53984 735.9 main_async_dispatch_8_matmul_like_256x3136x64_f16xf16xf32
0.4 37019023 1003 36908.3 36832.0 36000 41760 609.8 main_async_dispatch_77_matmul_1x1000x2048_f32xf16xf32
0.2 18740177 1003 18684.1 18720.0 17888 21696 435.7 main_async_dispatch_5_matmul_like_64x56x56x64_f16xf16xf32
0.2 17308891 6018 2876.2 2848.0 2720 3521 84.1 main_async_dispatch_38_elementwise_256x196_f32xf16xf16xf16
0.1 10054807 1003 10024.7 10016.0 9888 11392 151.4 main_async_dispatch_3_slow_memcpy
0.1 8109569 4012 2021.3 2016.0 1888 2400 50.9 main_async_dispatch_20_elementwise_128x784_f32xf16xf16xf16
0.1 7231109 3009 2403.2 2400.0 2272 2848 66.4 main_async_dispatch_7_elementwise_64x3136_f32xf16xf16xf16
0.1 7166236 1003 7144.8 7136.0 6976 8064 109.6 main_async_dispatch_76_generic_2048x49_f32xf16xf16xf16xf16
0.1 6366543 3009 2115.8 2112.0 1984 2528 69.9 main_async_dispatch_64_elementwise_512x49_f32xf16xf16xf16
0.1 5611206 1003 5594.4 5569.0 5440 6528 110.6 main_async_dispatch_67_elementwise_2048x49_f32xf16xf16xf32xf16xf16xf16
0.1 5175737 1003 5160.3 5152.0 5088 5920 84.0 main_async_dispatch_2_elementwise_64x12544_f32xf16xf16xf16
0.1 5088449 1003 5073.2 5056.0 4992 5792 79.9 main_async_dispatch_4_pooling_nchw_max_1x64x56x56x3x3_f16
0.0 3634256 1003 3623.4 3616.0 3456 5920 108.3 main_async_dispatch_23_elementwise_512x784_f32xf16xf16xf32xf16xf16xf16
0.0 3502200 1003 3491.7 3488.0 3360 4096 71.7 main_async_dispatch_41_elementwise_1024x196_f32xf16xf16xf32xf16xf16xf16
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- --------- --------- -------- -------- ----------- ---------------------------------
49.4 56373046 9511 5927.1 3615.0 1630 29697 5832.3 [CUDA Unified Memory memcpy HtoD]
27.0 30828177 8024 3842.0 2623.0 863 20896 3176.0 [CUDA Unified Memory memcpy DtoH]
19.8 22612324 18054 1252.5 1184.0 1119 2048 164.9 [CUDA memset]
3.7 4241628 1 4241628.0 4241628.0 4241628 4241628 0.0 [CUDA memcpy HtoD]
7.采用IREE运行resnet50[ONNX导入的方式]
# 导入onnx
iree-import-onnx resnet50.onnx -o resnet50.mlir
# 编译mlir
iree-compile resnet50.mlir --iree-hal-target-backends=cuda --iree-cuda-target=sm_86 \
--iree-codegen-llvmgpu-use-mma-sync \
--iree-codegen-llvmgpu-use-wmma \
--iree-llvmgpu-enable-prefetch \
--iree-codegen-llvmgpu-vectorize-pipeline \
-o resnet50.vmfb
# 运行测试程序
cat> iree_forward.py<<-'EOF'
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time
import iree.compiler as ireec
input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
rt.VmModule.copy_buffer(config.vm_instance, open("resnet50.vmfb",'rb').read()),
config)
print(vmm._vm_module.function_names)
# warm up
for i in range(3):
y = vmm.main_graph(input_tensor)
# benchmark
t0=time.time()
for i in range(1000):
y = vmm.main_graph(input_tensor)
t1=time.time()
print("{:.2f} FPS".format(1000/(t1-t0)))
EOF
python iree_forward.py
# ncu profing
cat> iree_forward.py<<-'EOF'
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time
import iree.compiler as ireec
input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
rt.VmModule.copy_buffer(config.vm_instance, open("resnet50.vmfb",'rb').read()),
config)
print(vmm._vm_module.function_names)
# warm up
for i in range(3):
y = vmm.main_graph(input_tensor)
EOF
ncu --clock-control=none --metrics \
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
sm__cycles_elapsed.avg.per_second,\
sm__cycles_elapsed python iree_forward.py
输出
87.23 FPS
main_graph_async_dispatch_78_matmul_1x1000x2048_f16xf16xf32 (1000, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
Section: Command line profiler metrics
------------------------------------------------------------------------------------ ------------- ------------
Metric Name Metric Unit Metric Value
------------------------------------------------------------------------------------ ------------- ------------
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed % 73.87
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed % 13.92
sm__cycles_elapsed.avg cycle 61517.49
sm__cycles_elapsed.max cycle 61774
sm__cycles_elapsed.min cycle 61365
sm__cycles_elapsed.sum cycle 5044434
sm__cycles_elapsed.avg.per_second cycle/nsecond 1.95
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second (!) n/a
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active % 0
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed % 0
------------------------------------------------------------------------------------ ------------- ------------