当前位置：首页 > article >正文

IREE和TensorRT性能对比

article 2025/3/1 18:40:39

IREE和TensorRT性能对比

一.背景
二.操作步骤
- 1.创建容器
- 2.安装依赖
- 3.获取设备信息
- 4.生成onnx模型
- 5.采用trtexec测试模型的性能
- 6.采用IREE运行resnet50[Pytorch-AOT方式]
- 7.采用IREE运行resnet50[ONNX导入的方式]

一.背景

想对比在同一张GPU、相同的模型下,IREE与TensorRT的性能差异
测试结果:TRT:1912.23FPS IREE:87.23 FPS
IREE没有用上tensorcore,还不知道什么原因

二.操作步骤

1.创建容器

mkdir iree_trt_bench
cd iree_trt_bench
docker stop iree_trt_bench
docker rm iree_trt_bench
docker run --gpus all --shm-size=32g -id -e NVIDIA_VISIBLE_DEVICES=all --privileged \
        -v $PWD:/home -w /home \
        --name iree_trt_bench --hostname=iree_trt_bench nvcr.io/nvidia/pytorch:23.07-py3 /bin/bash
docker exec -ti iree_trt_bench bash

2.安装依赖

pip uninstall torch
python -m pip install torch -i https://pypi.tuna.tsinghua.edu.cn/simple/
python -m pip install iree-turbine -i https://pypi.tuna.tsinghua.edu.cn/simple/
python -m pip install iree-base-compiler[onnx] iree-base-runtime -i https://pypi.tuna.tsinghua.edu.cn/simple/

3.获取设备信息

iree-run-module --list_drivers
iree-run-module --list_devices

输出

	cuda: NVIDIA CUDA HAL driver (via dylib)
	 hip: HIP HAL driver (via dylib)
local-sync: Local execution using a lightweight inline synchronous queue
local-task: Local execution using the IREE multithreading task system
  vulkan: Vulkan 1.x (dynamic)

cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0
local-sync://
local-task://

4.生成onnx模型

cat> resnet50.py<<-'EOF'    
import torchvision.transforms as transforms
import torch
import torchvision.models as models

# 转为half类型
input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
model = models.resnet50(pretrained=False).half()
model.eval() 
script_model = torch.jit.trace(model, input_tensor,)

with torch.no_grad():
    output = model(input_tensor)

input_names = ["input"]
output_names = ["output"]
torch.onnx.export(model, input_tensor, "resnet50.onnx", 
					verbose=False, input_names=input_names,
					output_names=output_names,opset_version=17,export_params=True)
EOF
python resnet50.py

5.采用trtexec测试模型的性能

# int8
trtexec --onnx=resnet50.onnx \
					--int8 --useCudaGraph --streams=3 --threads --useSpinWait --useManagedMemory \
					--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000 2>&1 | egrep "Throughput|Selected Device"
# fp16					
trtexec --onnx=resnet50.onnx \
					--fp16 --useCudaGraph --streams=3 --threads --useSpinWait --useManagedMemory \
					--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000 2>&1 | egrep "Throughput|Selected Device"
# fp16单stream					
trtexec --onnx=resnet50.onnx \
					--fp16 --streams=1 --useSpinWait  --noDataTransfers=0 \
					--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000 2>&1 | egrep "Throughput|Selected Device"
# profing
trtexec --onnx=resnet50.onnx \
					--fp16 --streams=1 --useSpinWait  --noDataTransfers=0 --saveEngine="resnet50.engine" --skipInference \
					--duration=30 --device=0 --memPoolSize=workspace:15360MiB				
					
nsys profile --stats=true -o cuda_profing_report.nsys-rep -f true -t cuda,nvtx trtexec --loadEngine="resnet50.engine" \
					--fp16 --streams=1 --useSpinWait  --noDataTransfers=0 \
					--duration=30 --device=0 --memPoolSize=workspace:15360MiB --iterations=1000

输出

[01/08/2025-09:36:29] [I] Selected Device: NVIDIA GeForce RTX 3090
[01/08/2025-09:37:47] [I] Throughput: 4507.64 qps

[01/08/2025-09:38:18] [I] Selected Device: NVIDIA GeForce RTX 3090
[01/08/2025-09:39:31] [I] Throughput: 3253.95 qps

[01/08/2025-09:39:50] [I] Selected Device: NVIDIA GeForce RTX 3090
[01/08/2025-09:41:03] [I] Throughput: 1912.23 qps

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     31.0       4910127944     436475   11249.5   10463.0      7073     20544       3779.4  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage5_warpsize2x2x1…
     11.4       1804272642    1578025    1143.4    1120.0      1056      1632         55.1  void genericReformat::copyVectorizedKernel<double, __half, __half, (bool)1, (bool)0, (int)1>(unsign…
     10.9       1724732099     235025    7338.5    6368.0      6175     11232       1515.0  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage5_warpsize2x2x1…
      9.2       1464470288     268600    5452.2    5153.0      4160      7936        759.1  sm80_xmma_fprop_implicit_gemm_indexed_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage4_warps…
      5.1        802538557     134300    5975.7    5568.0      5087      9120       1124.2  sm80_xmma_fprop_implicit_gemm_indexed_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage5_warps…
      4.8        762171900     100725    7566.9    6816.0      6304     10464       1247.3  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize2x2x1…
      4.2        666389805     100725    6615.9    6560.0      6272      7904        177.1  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize2x1x2…
      4.2        659694863     100725    6549.5    6432.0      6208      7968        235.9  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x128x32_stage4_warpsize2x2…
      3.3        524938823     100725    5211.6    5056.0      4928      6560        249.7  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x128x32_stage5_warpsize2x2x…
      2.2        351567495      33575   10471.1   10464.0      9984     11936        173.8  sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x64_sta…
      2.2        346470431      67150    5159.6    5216.0      4800      6240        213.7  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x32_stage4_warpsize4x1x…
      1.6        250187681      33575    7451.6    7456.0      7296      7968         54.4  sm50_xmma_cublas_gemvx_f16f16_f32_f32_tn_n_int32_unit_n_launch_param16x32x32_strided_unit_stride_ex…
      1.4        229378733      33575    6831.8    6816.0      6433      7616        118.5  trt_ampere_h16816gemm_64x64_ldg8_relu_stages_64x5_nn_v1
      1.4        225713543      33575    6722.7    6720.0      6528      7808        100.7  sm80_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize64x128x32_stage5_warpsize2x2x1_tensor16x8x16_execute_…
      1.4        225337960      33575    6711.5    6719.0      6528      7872        103.8  sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x64x32_stage5_warpsize2x2x…
      1.2        192754632      33575    5741.0    5728.0      5600      6625         80.4  sm80_xmma_gemm_f16f16_f16f16_f16_tn_n_tilesize256x64x32_stage3_warpsize4x1x1_tensor16x8x16_execute_…
      1.1        167363383      33575    4984.8    4992.0      4832      5856         78.8  trt_ampere_h16816gemm_64x64_ldg8_relu_stages_64x5_tn_v1
      1.0        163191373      33575    4860.5    4832.0      4736      5792         78.0  trt_ampere_h16816gemm_128x64_ldg8_relu_stages_32x6_tn_v1
      0.8        128487407      33575    3826.9    3809.0      3617      4607         68.2  sm50_xmma_pooling_max_nhwc_FP16FP32_WINDOWSIZE_3_PROPAGATE_NAN_2D_execute_kernel_trt
      0.7        106704463      33575    3178.1    3168.0      3040      3968         53.9  void genericReformat::copyPackedKernel<float, __half, (bool)1, (bool)1, genericReformat::ArrayN<(in…
      0.6        100768429      33575    3001.3    3008.0      2912      3488         50.7  sm50_xmma_pooling_fw_4d_FP16FP32NHWC_Average_FastDiv_CAlign4_execute_kernel_trt
      0.2         39201047      33575    1167.6    1153.0      1119      1375         21.8  void genericReformat::copyVectorizedKernel<double, __half, float, (bool)1, (bool)0, (int)1>(unsigne…

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ------------------
     99.9          9934338      2  4967169.0  4967169.0     63104   9871234    6935395.2  [CUDA memcpy HtoD]
      0.1             9023      5     1804.6     1632.0       672      4320       1483.4  [CUDA memset]
      0.0             3839      3     1279.7     1279.0      1248      1312         32.0  [CUDA memcpy DtoD]
      0.0             1344      1     1344.0     1344.0      1344      1344          0.0  [CUDA memcpy DtoH]

6.采用IREE运行resnet50[Pytorch-AOT方式]

cat> iree_forward_resnet50.py<<-'EOF'  
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time

input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
model = models.resnet50(pretrained=False).half()
model.eval() 

export_output = aot.export(model, input_tensor)
export_output.save_mlir("resnet50.mlir")
compiled_binary = export_output.compile(save_to=None,target_backends="cuda")

config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
	rt.VmModule.copy_buffer(config.vm_instance, compiled_binary.map_memory()),
	config)
	
# warm up
for i in range(3):
    y = vmm.main(input_tensor)

# benchmark
t0=time.time()
for i in range(1000):
    y = vmm.main(input_tensor)
t1=time.time()
print("{:.2f} FPS".format(1000/(t1-t0)))

EOF
python iree_forward_resnet50.py
nsys profile --stats=true -o cuda_profing_report.nsys-rep -f true -t cuda,nvtx python iree_forward_resnet50.py

输出

88.79 FPS

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                    Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -------------------------------------------------------------------------
     16.6       1676559094       2006  835772.2  833919.5    825087    946527      12354.4  main_async_dispatch_69_conv_2d_nchw_fchw_1x512x7x7x512x3x3_f16xf16xf32
     11.2       1129917546       5015  225307.6  223968.0    220224    282368       5630.3  main_async_dispatch_43_conv_2d_nchw_fchw_1x256x14x14x256x3x3_f16xf16xf32
      8.4        852021677       1003  849473.3  867104.0    437792    992511      69058.9  main_async_dispatch_0_slow_memcpy
      8.0        813216409       1003  810784.1  809056.0    800383    917631      11968.5  main_async_dispatch_63_conv_2d_nchw_fchw_1x512x7x7x512x3x3_f16xf16xf32
      7.8        792092582       5015  157944.7  157632.0    154784    418303       4396.8  main_async_dispatch_42_matmul_like_256x14x14x1024_f16xf16xf32
      5.4        544241759       2006  271307.0  270784.0    267328    307744       4031.4  main_async_dispatch_68_matmul_like_512x7x7x2048_f16xf16xf32
      5.3        537730923       5015  107224.5  107008.0    105536    121600       1571.5  main_async_dispatch_45_matmul_like_1024x14x14x256_f16xf16xf32
      3.1        311051603       3009  103373.7  103168.0    101535    117409       1559.3  main_async_dispatch_25_conv_2d_nchw_fchw_1x128x28x28x128x3x3_f16xf16xf32
      2.7        277745062       1003  276914.3  276415.0    272480    313439       4107.9  main_async_dispatch_62_matmul_like_512x14x14x1024_f16xf16xf32
      2.7        274588058       3009   91255.6   91104.0     76992    104000       1469.5  main_async_dispatch_27_matmul_like_512x28x28x128_f16xf16xf32
      2.5        253636478       1003  252877.8  257440.0    219104    297728       9518.0  main_async_dispatch_37_conv_2d_nchw_fchw_1x256x14x14x256x3x3_f16xf16xf32
      2.0        203783616       1003  203174.1  202720.0    200800    229600       2944.9  main_async_dispatch_71_matmul_like_2048x7x7x512_f16xf16xf32
      2.0        203007098       1003  202399.9  200288.0    193056    240479       5555.6  main_async_dispatch_18_matmul_like_128x56x56x256_f16xf16xf32
      2.0        197697763       1003  197106.4  196672.0    194400    224128       2932.8  main_async_dispatch_66_conv_2d_nchw_fchw_1x2048x7x7x1024x1x1_f16xf16xf32
      1.7        170180476       2006   84835.7   84544.0     79135     99168       2180.3  main_async_dispatch_13_matmul_like_256x56x56x64_f16xf16xf32
      1.6        158539120       1003  158064.9  157727.0    156224    179104       2349.6  main_async_dispatch_36_matmul_like_256x28x28x512_f16xf16xf32
      1.5        150716523       1003  150265.7  149504.0    146367    398271       8291.8  main_async_dispatch_22_conv_2d_nchw_fchw_1x512x28x28x256x1x1_f16xf16xf32
      1.5        150515646       1003  150065.4  149759.0    148288    169761       2202.5  main_async_dispatch_24_matmul_like_128x28x28x512_f16xf16xf32
      1.5        150248820       1003  149799.4  149472.0    147968    169600       2211.9  main_async_dispatch_28_matmul_like_128x28x28x512_f16xf16xf32
      1.5        149922213       1003  149473.8  149151.0    147583    169216       2209.9  main_async_dispatch_32_matmul_like_128x28x28x512_f16xf16xf32
      1.5        148435384       2006   73995.7   73888.0     72224     84544       1155.5  main_async_dispatch_65_matmul_like_2048x49x512_f16xf16xf32
      1.4        143103376       3009   47558.4   47424.0     46560     54559        788.6  main_async_dispatch_6_conv_2d_nchw_fchw_1x64x56x56x64x3x3_f16xf16xf32
      1.2        126157349       1003  125780.0  125567.0    120128    144992       2611.2  main_async_dispatch_9_matmul_like_256x56x56x64_f16xf16xf32
      1.0         96413797       1003   96125.4   96000.0     93504    111968       1687.2  main_async_dispatch_1_conv_2d_nchw_fchw_1x64x112x112x3x7x7_f16xf16xf32
      0.9         91943170       1003   91668.2   91488.0     90016    105024       1403.8  main_async_dispatch_19_conv_2d_nchw_fchw_1x128x28x28x128x3x3_f16xf16xf32
      0.8         84790670       1003   84537.1   84384.0     82656     96384       1336.5  main_async_dispatch_40_conv_2d_nchw_fchw_1x1024x14x14x512x1x1_f16xf16xf32
      0.7         66988042       1003   66787.7   66528.0     65312     81184       1286.4  main_async_dispatch_10_matmul_like_64x56x56x256_f16xf16xf32
      0.7         66982463       1003   66782.1   66528.0     65440     76480       1184.4  main_async_dispatch_14_matmul_like_64x56x56x256_f16xf16xf32
      0.6         59671134       1003   59492.7   59360.0     58272     68544        990.6  main_async_dispatch_39_matmul_like_1024x196x256_f16xf16xf32
      0.5         55640924       1003   55474.5   55360.0     54560     63488        894.0  main_async_dispatch_21_matmul_like_512x784x128_f16xf16xf32
      0.5         46757807       1003   46618.0   46528.0     45759     53984        735.9  main_async_dispatch_8_matmul_like_256x3136x64_f16xf16xf32
      0.4         37019023       1003   36908.3   36832.0     36000     41760        609.8  main_async_dispatch_77_matmul_1x1000x2048_f32xf16xf32
      0.2         18740177       1003   18684.1   18720.0     17888     21696        435.7  main_async_dispatch_5_matmul_like_64x56x56x64_f16xf16xf32
      0.2         17308891       6018    2876.2    2848.0      2720      3521         84.1  main_async_dispatch_38_elementwise_256x196_f32xf16xf16xf16
      0.1         10054807       1003   10024.7   10016.0      9888     11392        151.4  main_async_dispatch_3_slow_memcpy
      0.1          8109569       4012    2021.3    2016.0      1888      2400         50.9  main_async_dispatch_20_elementwise_128x784_f32xf16xf16xf16
      0.1          7231109       3009    2403.2    2400.0      2272      2848         66.4  main_async_dispatch_7_elementwise_64x3136_f32xf16xf16xf16
      0.1          7166236       1003    7144.8    7136.0      6976      8064        109.6  main_async_dispatch_76_generic_2048x49_f32xf16xf16xf16xf16
      0.1          6366543       3009    2115.8    2112.0      1984      2528         69.9  main_async_dispatch_64_elementwise_512x49_f32xf16xf16xf16
      0.1          5611206       1003    5594.4    5569.0      5440      6528        110.6  main_async_dispatch_67_elementwise_2048x49_f32xf16xf16xf32xf16xf16xf16
      0.1          5175737       1003    5160.3    5152.0      5088      5920         84.0  main_async_dispatch_2_elementwise_64x12544_f32xf16xf16xf16
      0.1          5088449       1003    5073.2    5056.0      4992      5792         79.9  main_async_dispatch_4_pooling_nchw_max_1x64x56x56x3x3_f16
      0.0          3634256       1003    3623.4    3616.0      3456      5920        108.3  main_async_dispatch_23_elementwise_512x784_f32xf16xf16xf32xf16xf16xf16
      0.0          3502200       1003    3491.7    3488.0      3360      4096         71.7  main_async_dispatch_41_elementwise_1024x196_f32xf16xf16xf32xf16xf16xf16

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)              Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ---------------------------------
     49.4         56373046   9511     5927.1     3615.0      1630     29697       5832.3  [CUDA Unified Memory memcpy HtoD]
     27.0         30828177   8024     3842.0     2623.0       863     20896       3176.0  [CUDA Unified Memory memcpy DtoH]
     19.8         22612324  18054     1252.5     1184.0      1119      2048        164.9  [CUDA memset]
      3.7          4241628      1  4241628.0  4241628.0   4241628   4241628          0.0  [CUDA memcpy HtoD]

7.采用IREE运行resnet50[ONNX导入的方式]

# 导入onnx
iree-import-onnx resnet50.onnx -o resnet50.mlir
# 编译mlir
iree-compile resnet50.mlir --iree-hal-target-backends=cuda --iree-cuda-target=sm_86 \
	--iree-codegen-llvmgpu-use-mma-sync \
	--iree-codegen-llvmgpu-use-wmma \
	--iree-llvmgpu-enable-prefetch \
	--iree-codegen-llvmgpu-vectorize-pipeline \
	-o resnet50.vmfb
	
# 运行测试程序
cat> iree_forward.py<<-'EOF'  
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time
import iree.compiler as ireec

input_tensor = torch.ones((1,3,224,224),dtype=torch.half)

config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
	rt.VmModule.copy_buffer(config.vm_instance, open("resnet50.vmfb",'rb').read()),
	config)
print(vmm._vm_module.function_names)
# warm up
for i in range(3):
    y = vmm.main_graph(input_tensor)
# benchmark
t0=time.time()
for i in range(1000):
    y = vmm.main_graph(input_tensor)
t1=time.time()
print("{:.2f} FPS".format(1000/(t1-t0)))
EOF
python iree_forward.py

# ncu profing
cat> iree_forward.py<<-'EOF'  
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time
import iree.compiler as ireec
input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
	rt.VmModule.copy_buffer(config.vm_instance, open("resnet50.vmfb",'rb').read()),
	config)
print(vmm._vm_module.function_names)
# warm up
for i in range(3):
    y = vmm.main_graph(input_tensor)
EOF
ncu --clock-control=none --metrics \
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
sm__cycles_elapsed.avg.per_second,\
sm__cycles_elapsed python iree_forward.py

输出

87.23 FPS

main_graph_async_dispatch_78_matmul_1x1000x2048_f16xf16xf32 (1000, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
    Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
    Section: Command line profiler metrics
    ------------------------------------------------------------------------------------ ------------- ------------
    Metric Name                                                                            Metric Unit Metric Value
    ------------------------------------------------------------------------------------ ------------- ------------
    gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed                                 %        73.87
    gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed                                           %        13.92
    sm__cycles_elapsed.avg                                                                       cycle     61517.49
    sm__cycles_elapsed.max                                                                       cycle        61774
    sm__cycles_elapsed.min                                                                       cycle        61365
    sm__cycles_elapsed.sum                                                                       cycle      5044434
    sm__cycles_elapsed.avg.per_second                                                    cycle/nsecond         1.95
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed                    (!) n/a
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second                                       (!) n/a
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed                    (!) n/a
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum                                                  (!) n/a
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed                    (!) n/a
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained                                   (!) n/a
    sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second                                       (!) n/a
    sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active                                   %            0
    sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed                                  %            0
    ------------------------------------------------------------------------------------ ------------- ------------