Rocprofiler测试
Rocprofiler测试
- 一.参考链接
- 二.测试过程
- 1.登录服务器
- 2.使用smi获取列表
- 3.使用rocminfo获取Agent信息
- 4.准备测试用例
- 5.The hardware counters are called the basic counters
- 6.The derived metrics are defined on top of the basic counters using mathematical expression
- 7.Profing
Rocprofiler测试
一.参考链接
- Compatibility matrix
- AMD Radeon Pro VII
- Radeon™ PRO VII Specifications
- 6.2.0 Supported GPUs
- Performance model&相关名词解释
二.测试过程
1.登录服务器
.TODO
2.使用smi获取列表
rocm-smi
输出
=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
========================================================================================================================
0 1 0x66a1, 3820 35.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0%
1 2 0x66a1, 22570 38.0°C 17.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================
3.使用rocminfo获取Agent信息
在 ROCm(Radeon Open Compute)平台中,Agent 通常指的是计算设备或处理单元,这些可以是 CPU 或 GPU。每个 Agent 可以执行计算任务并具有自己的计算资源,如计算核心、内存等。在 ROCm 的程序模型中,Agent 是负责执行特定任务的实体,当你使用 ROCm 进行并行计算时,任务通常会分配给不同的 Agent 来处理。Agent 是 ROCm 的异构计算环境中进行任务调度和管理的基本单元之一
rocminfo
输出
*******
Agent 2
*******
Name: gfx906
Uuid: GPU-021860c17348c2f7
Marketing Name: AMD Radeon (TM) Pro VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26273(0x66a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 1792
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 145
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
4.准备测试用例
tee ROCmMatrixTranspose.cpp<<-'EOF'
#include <iostream>
// hip header file
#include <hip/hip_runtime.h>
// roctx header file
#include <roctracer/roctx.h>
#define WIDTH 1024
#define NUM (WIDTH * WIDTH)
#define THREADS_PER_BLOCK_X 4
#define THREADS_PER_BLOCK_Y 4
#define THREADS_PER_BLOCK_Z 1
// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {
int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
out[y * width + x] = in[x * width + y];
}
// CPU implementation of matrix transpose
void matrixTransposeCPUReference(float* output, float* input, const unsigned int width) {
for (unsigned int j = 0; j < width; j++) {
for (unsigned int i = 0; i < width; i++) {
output[i * width + j] = input[j * width + i];
}
}
}
int main() {
float* Matrix;
float* TransposeMatrix;
float* cpuTransposeMatrix;
float* gpuMatrix;
float* gpuTransposeMatrix;
hipDeviceProp_t devProp;
hipGetDeviceProperties(&devProp, 0);
std::cout << "Device name " << devProp.name << std::endl;
int i;
int errors;
Matrix = (float*)malloc(NUM * sizeof(float));
TransposeMatrix = (float*)malloc(NUM * sizeof(float));
cpuTransposeMatrix = (float*)malloc(NUM * sizeof(float));
// initialize the input data
for (i = 0; i < NUM; i++) {
Matrix[i] = (float)i * 10.0f;
}
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
uint32_t iterations = 1;
while (iterations-- > 0) {
std::cout << "## Iteration (" << iterations << ") #################" << std::endl;
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
roctxMark("ROCTX-MARK: before hipLaunchKernel");
roctxRangePush("ROCTX-RANGE: hipLaunchKernel");
roctx_range_id_t roctx_id = roctxRangeStartA("roctx_range with id");
// Lauching kernel from host
hipLaunchKernelGGL(
matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH);
roctxRangeStop(roctx_id);
roctxMark("ROCTX-MARK: after hipLaunchKernel");
// Memory transfer from device to host
roctxRangePush("ROCTX-RANGE: hipMemcpy");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
// CPU MatrixTranspose computation
matrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);
// verify the results
errors = 0;
double eps = 1.0E-6;
for (i = 0; i < NUM; i++) {
if (std::abs(TransposeMatrix[i] - cpuTransposeMatrix[i]) > eps) {
errors++;
}
}
if (errors != 0) {
printf("FAILED: %d errors\n", errors);
} else {
printf("PASSED!\n");
}
}
// free the resources on device side
hipFree(gpuMatrix);
hipFree(gpuTransposeMatrix);
// free the resources on host side
free(Matrix);
free(TransposeMatrix);
free(cpuTransposeMatrix);
return errors;
}
EOF
/opt/rocm/bin/hipcc -c ROCmMatrixTranspose.cpp -o ROCmMatrixTranspose.cpp.o
/opt/rocm/bin/hipcc ROCmMatrixTranspose.cpp.o -o ROCmMatrixTranspose \
/opt/rocm/lib/libamd_comgr.so.2.8.60200 /usr/lib/x86_64-linux-gnu/libnuma.so /opt/rocm/lib/libroctx64.so
./ROCmMatrixTranspose
5.The hardware counters are called the basic counters
rocprof --list-basic | grep -A 2 "gpu-agent2"
输出
gpu-agent2 : TCC_EA1_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.
block TCC has 4 counters
gpu-agent2 : TCC_EA1_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.
block TCC has 4 counters
gpu-agent2 : TCC_EA1_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
block TCC has 4 counters
gpu-agent2 : TCC_EA1_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
block TCC has 4 counters
gpu-agent2 : TCC_EA1_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
block TCC has 4 counters
gpu-agent2 : GRBM_COUNT : Tie High - Count Number of Clocks
block GRBM has 2 counters
gpu-agent2 : GRBM_GUI_ACTIVE : The GUI is Active
block GRBM has 2 counters
gpu-agent2 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote only from/to LDS (only works if EARLY_TA_DONE is enabled). (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT). (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction issue. In units of 4 cycles. (per-simd, nondeterministic)
block SQ has 8 counters
gpu-agent2 : SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction arbiter is working on a VALU instruction. (per-simd, nondeterministic). Units in quad-cycles(4 cycles)
block SQ has 8 counters
gpu-agent2 : SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read scalar operations. (per-simd, emulated)
block SQ has 8 counters
gpu-agent2 : SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU operations (similar to INST_CYCLES_VALU but multiplied by # of active threads). (per-simd)
block SQ has 8 counters
gpu-agent2 : SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts. (emulated)
block SQ has 8 counters
gpu-agent2 : TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for this counter.
block TA has 2 counters
gpu-agent2 : TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.
block TA has 2 counters
gpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.
block TA has 2 counters
gpu-agent2 : TCC_HIT[0-15] : Number of cache hits.
block TCC has 4 counters
gpu-agent2 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
block TCC has 4 counters
gpu-agent2 : TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.
block TCC has 4 counters
gpu-agent2 : TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.
block TCC has 4 counters
gpu-agent2 : TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
block TCC has 4 counters
gpu-agent2 : TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
block TCC has 4 counters
gpu-agent2 : TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
block TCC has 4 counters
gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.
block TCP has 4 counters
6.The derived metrics are defined on top of the basic counters using mathematical expression
rocprof --list-derived | grep -A 2 "gpu-agent2"
输出
gpu-agent2 : TCC_EA1_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC EA1s.
TCC_EA1_RDREQ_32B_sum = sum(TCC_EA1_RDREQ_32B,16)
gpu-agent2 : TCC_EA1_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC EA1s.
TCC_EA1_RDREQ_sum = sum(TCC_EA1_RDREQ,16)
gpu-agent2 : TCC_EA1_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC EA1s.
TCC_EA1_WRREQ_sum = sum(TCC_EA1_WRREQ,16)
gpu-agent2 : TCC_EA1_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC EA1s.
TCC_EA1_WRREQ_64B_sum = sum(TCC_EA1_WRREQ_64B,16)
gpu-agent2 : TCC_WRREQ1_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
TCC_WRREQ1_STALL_max = max(TCC_EA1_WRREQ_STALL,16)
gpu-agent2 : RDATA1_SIZE : The total kilobytes fetched from the video memory. This is measured on EA1s.
RDATA1_SIZE = (TCC_EA1_RDREQ_32B_sum*32+(TCC_EA1_RDREQ_sum-TCC_EA1_RDREQ_32B_sum)*64)
gpu-agent2 : WDATA1_SIZE : The total kilobytes written to the video memory. This is measured on EA1s.
WDATA1_SIZE = ((TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)*32+TCC_EA1_WRREQ_64B_sum*64)
gpu-agent2 : FETCH_SIZE : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
FETCH_SIZE = (TCC_EA_RDREQ_32B_sum*32+(TCC_EA_RDREQ_sum-TCC_EA_RDREQ_32B_sum)*64+RDATA1_SIZE)/1024
gpu-agent2 : WRITE_SIZE : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WRITE_SIZE = ((TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)*32+TCC_EA_WRREQ_64B_sum*64+WDATA1_SIZE)/1024
gpu-agent2 : WRITE_REQ_32B : The total number of 32-byte effective memory writes.
WRITE_REQ_32B = (TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)+(TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)+(TCC_EA_WRREQ_64B_sum+TCC_EA1_WRREQ_64B_sum)*2
gpu-agent2 : TA_BUSY_avr : TA block is busy. Average over TA instances.
TA_BUSY_avr = avr(TA_TA_BUSY,16)
gpu-agent2 : TA_BUSY_max : TA block is busy. Max over TA instances.
TA_BUSY_max = max(TA_TA_BUSY,16)
gpu-agent2 : TA_BUSY_min : TA block is busy. Min over TA instances.
TA_BUSY_min = min(TA_TA_BUSY,16)
gpu-agent2 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)
gpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)
gpu-agent2 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
TCC_HIT_sum = sum(TCC_HIT,16)
gpu-agent2 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
TCC_MISS_sum = sum(TCC_MISS,16)
gpu-agent2 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)
gpu-agent2 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)
gpu-agent2 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)
gpu-agent2 : TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC instances.
TCC_EA_WRREQ_64B_sum = sum(TCC_EA_WRREQ_64B,16)
gpu-agent2 : TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
TCC_WRREQ_STALL_max = max(TCC_EA_WRREQ_STALL,16)
gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_sum : Total number of TCP stalls TA data interface.
TCP_TCP_TA_DATA_STALL_CYCLES_sum = sum(TCP_TCP_TA_DATA_STALL_CYCLES,16)
gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_max : Maximum number of TCP stalls TA data interface.
TCP_TCP_TA_DATA_STALL_CYCLES_max = max(TCP_TCP_TA_DATA_STALL_CYCLES,16)
gpu-agent2 : VFetchInsts : The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
VFetchInsts = (SQ_INSTS_VMEM_RD-TA_FLAT_READ_WAVEFRONTS_sum)/SQ_WAVES
gpu-agent2 : VWriteInsts : The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
VWriteInsts = (SQ_INSTS_VMEM_WR-TA_FLAT_WRITE_WAVEFRONTS_sum)/SQ_WAVES
gpu-agent2 : FlatVMemInsts : The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
FlatVMemInsts = (SQ_INSTS_FLAT-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVES
gpu-agent2 : LDSInsts : The average number of LDS read or LDS write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS.
LDSInsts = (SQ_INSTS_LDS-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVES
gpu-agent2 : FlatLDSInsts : The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).
FlatLDSInsts = SQ_INSTS_FLAT_LDS_ONLY/SQ_WAVES
gpu-agent2 : VALUUtilization : The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUUtilization = 100*SQ_THREAD_CYCLES_VALU/(SQ_ACTIVE_INST_VALU*MAX_WAVE_SIZE)
gpu-agent2 : VALUBusy : The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
VALUBusy = 100*SQ_ACTIVE_INST_VALU*4/SIMD_NUM/GRBM_GUI_ACTIVE
gpu-agent2 : SALUBusy : The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy = 100*SQ_INST_CYCLES_SALU*4/SIMD_NUM/GRBM_GUI_ACTIVE
gpu-agent2 : FetchSize : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
FetchSize = FETCH_SIZE
gpu-agent2 : WriteSize : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize = WRITE_SIZE
gpu-agent2 : MemWrites32B : The total number of effective 32B write transactions to the memory
MemWrites32B = WRITE_REQ_32B
gpu-agent2 : L2CacheHit : The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
L2CacheHit = 100*sum(TCC_HIT,16)/(sum(TCC_HIT,16)+sum(TCC_MISS,16))
gpu-agent2 : MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
MemUnitStalled = 100*max(TCP_TCP_TA_DATA_STALL_CYCLES,16)/GRBM_GUI_ACTIVE/SE_NUM
gpu-agent2 : WriteUnitStalled : The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
WriteUnitStalled = 100*TCC_WRREQ_STALL_max/GRBM_GUI_ACTIVE
gpu-agent2 : LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
LDSBankConflict = 100*SQ_LDS_BANK_CONFLICT/GRBM_GUI_ACTIVE/CU_NUM
gpu-agent2 : GPUBusy : The percentage of time GPU was busy.
GPUBusy = 100*GRBM_GUI_ACTIVE/GRBM_COUNT
gpu-agent2 : Wavefronts : Total wavefronts.
Wavefronts = SQ_WAVES
gpu-agent2 : VALUInsts : The average number of vector ALU instructions executed per work-item (affected by flow control).
VALUInsts = SQ_INSTS_VALU/SQ_WAVES
gpu-agent2 : SALUInsts : The average number of scalar ALU instructions executed per work-item (affected by flow control).
SALUInsts = SQ_INSTS_SALU/SQ_WAVES
gpu-agent2 : SFetchInsts : The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
SFetchInsts = SQ_INSTS_SMEM/SQ_WAVES
gpu-agent2 : GDSInsts : The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
GDSInsts = SQ_INSTS_GDS/SQ_WAVES
gpu-agent2 : MemUnitBusy : The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitBusy = 100*max(TA_TA_BUSY,16)/GRBM_GUI_ACTIVE/SE_NUM
gpu-agent2 : ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
ALUStalledByLDS = 100*SQ_WAIT_INST_LDS*4/SQ_WAVES/GRBM_GUI_ACTIVE
7.Profing
tee input.txt<<-'EOF'
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0 : 1
gpu: 0
kernel:matrixTranspose
EOF
rocprof -i input.txt ./ROCmMatrixTranspose
cat /root/input.csv
rocprofv2 -i input.txt ./ROCmMatrixTranspose
rocprofv2 --hsa-trace ./ROCmMatrixTranspose
输出
RPL: on '240920_102257' from '/opt/rocm-6.2.0' in '/root'
RPL: profiling '"./ROCmMatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_240920_102257_47892'
RPL: result dir '/tmp/rpl_data_240920_102257_47892/input0_results_240920_102257'
ROCProfiler: input from "/tmp/rpl_data_240920_102257_47892/input0.xml"
gpu_index = 0
kernel = matrixTranspose
range = 0:1
4 metrics
Wavefronts, VALUInsts, SALUInsts, SFetchInsts
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!
ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_240920_102257_47892/input0_results_240920_102257
File '/root/input.csv' is generating
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,SALUInsts,SFetchInsts
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,48178,48178,1048576,16,0,0,8,0,16,64,0x0,0x742031870880,65536.0000000000,14.0000000000,4.0000000000,3.0000000000
ROCProfilerV2: Collecting the following counters:
- Wavefronts
- VALUInsts
- SALUInsts
- SFetchInsts
Enabling Counter Collection
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!
Dispatch_ID(0), GPU_ID(1), Queue_ID(1), Process_ID(48209), Thread_ID(48209), Grid_Size(1048576), Workgroup_Size(16), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("matrixTranspose(float*, float*, int) (.kd)"), Begin_Timestamp(951172884265490), End_Timestamp(951172884454463), Correlation_ID(0), SALUInsts(4.000000), SFetchInsts(3.000000), VALUInsts(14.000000), Wavefronts(65536.000000)