当前位置: 首页 > article >正文

Rocprofiler测试

Rocprofiler测试

  • 一.参考链接
  • 二.测试过程
    • 1.登录服务器
    • 2.使用smi获取列表
    • 3.使用rocminfo获取Agent信息
    • 4.准备测试用例
    • 5.The hardware counters are called the basic counters
    • 6.The derived metrics are defined on top of the basic counters using mathematical expression
    • 7.Profing

Rocprofiler测试

一.参考链接

  • Compatibility matrix
  • AMD Radeon Pro VII
  • Radeon™ PRO VII Specifications
  • 6.2.0 Supported GPUs
  • Performance model&相关名词解释

二.测试过程

1.登录服务器

.TODO

2.使用smi获取列表

rocm-smi

输出

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
========================================================================================================================
0       1     0x66a1,   3820   35.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
1       2     0x66a1,   22570  38.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

3.使用rocminfo获取Agent信息

在 ROCm(Radeon Open Compute)平台中,Agent 通常指的是计算设备或处理单元,这些可以是 CPU 或 GPU。每个 Agent 可以执行计算任务并具有自己的计算资源,如计算核心、内存等。在 ROCm 的程序模型中,Agent 是负责执行特定任务的实体,当你使用 ROCm 进行并行计算时,任务通常会分配给不同的 Agent 来处理。Agent 是 ROCm 的异构计算环境中进行任务调度和管理的基本单元之一

rocminfo

输出

*******
Agent 2
*******
  Name:                    gfx906
  Uuid:                    GPU-021860c17348c2f7
  Marketing Name:          AMD Radeon (TM) Pro VII
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 26273(0x66a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1700
  BDFID:                   1792
  Internal Node ID:        1
  Compute Unit:            60
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 472
  SDMA engine uCode::      145
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16760832(0xffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    16760832(0xffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******

4.准备测试用例

tee ROCmMatrixTranspose.cpp<<-'EOF'
#include <iostream>
// hip header file
#include <hip/hip_runtime.h>
// roctx header file
#include <roctracer/roctx.h>

#define WIDTH 1024
#define NUM (WIDTH * WIDTH)
#define THREADS_PER_BLOCK_X 4
#define THREADS_PER_BLOCK_Y 4
#define THREADS_PER_BLOCK_Z 1

// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {
  int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
  int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
  out[y * width + x] = in[x * width + y];
}

// CPU implementation of matrix transpose
void matrixTransposeCPUReference(float* output, float* input, const unsigned int width) {
  for (unsigned int j = 0; j < width; j++) {
    for (unsigned int i = 0; i < width; i++) {
      output[i * width + j] = input[j * width + i];
    }
  }
}

int main() {
  float* Matrix;
  float* TransposeMatrix;
  float* cpuTransposeMatrix;

  float* gpuMatrix;
  float* gpuTransposeMatrix;

  hipDeviceProp_t devProp;
  hipGetDeviceProperties(&devProp, 0);

  std::cout << "Device name " << devProp.name << std::endl;

  int i;
  int errors;

  Matrix = (float*)malloc(NUM * sizeof(float));
  TransposeMatrix = (float*)malloc(NUM * sizeof(float));
  cpuTransposeMatrix = (float*)malloc(NUM * sizeof(float));

  // initialize the input data
  for (i = 0; i < NUM; i++) {
    Matrix[i] = (float)i * 10.0f;
  }

  // allocate the memory on the device side
  hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
  hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));

  uint32_t iterations = 1;
  while (iterations-- > 0) {
    std::cout << "## Iteration (" << iterations << ") #################" << std::endl;

    // Memory transfer from host to device
    hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);

    roctxMark("ROCTX-MARK: before hipLaunchKernel");
    roctxRangePush("ROCTX-RANGE: hipLaunchKernel");

    roctx_range_id_t roctx_id = roctxRangeStartA("roctx_range with id");

    // Lauching kernel from host
    hipLaunchKernelGGL(
        matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
        dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH);

    roctxRangeStop(roctx_id);
    roctxMark("ROCTX-MARK: after hipLaunchKernel");

    // Memory transfer from device to host
    roctxRangePush("ROCTX-RANGE: hipMemcpy");

    hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);

    roctxRangePop();  // for "hipMemcpy"
    roctxRangePop();  // for "hipLaunchKernel"

    // CPU MatrixTranspose computation
    matrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);

    // verify the results
    errors = 0;
    double eps = 1.0E-6;
    for (i = 0; i < NUM; i++) {
      if (std::abs(TransposeMatrix[i] - cpuTransposeMatrix[i]) > eps) {
        errors++;
      }
    }
    if (errors != 0) {
      printf("FAILED: %d errors\n", errors);
    } else {
      printf("PASSED!\n");
    }
  }

  // free the resources on device side
  hipFree(gpuMatrix);
  hipFree(gpuTransposeMatrix);

  // free the resources on host side
  free(Matrix);
  free(TransposeMatrix);
  free(cpuTransposeMatrix);

  return errors;
}

EOF

/opt/rocm/bin/hipcc -c ROCmMatrixTranspose.cpp -o ROCmMatrixTranspose.cpp.o
/opt/rocm/bin/hipcc ROCmMatrixTranspose.cpp.o -o ROCmMatrixTranspose \
    /opt/rocm/lib/libamd_comgr.so.2.8.60200 /usr/lib/x86_64-linux-gnu/libnuma.so /opt/rocm/lib/libroctx64.so	
./ROCmMatrixTranspose

5.The hardware counters are called the basic counters

rocprof --list-basic | grep -A 2  "gpu-agent2"

输出

  gpu-agent2 : TCC_EA1_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA1_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA1_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA1_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
      block TCC has 4 counters

  gpu-agent2 : TCC_EA1_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
      block TCC has 4 counters

  gpu-agent2 : GRBM_COUNT : Tie High - Count Number of Clocks
      block GRBM has 2 counters

  gpu-agent2 : GRBM_GUI_ACTIVE : The GUI is Active
      block GRBM has 2 counters

  gpu-agent2 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote only from/to LDS (only works if EARLY_TA_DONE is enabled). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction issue. In units of 4 cycles. (per-simd, nondeterministic)
      block SQ has 8 counters

  gpu-agent2 : SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction arbiter is working on a VALU instruction. (per-simd, nondeterministic). Units in quad-cycles(4 cycles)
      block SQ has 8 counters

  gpu-agent2 : SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read scalar operations. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent2 : SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU operations (similar to INST_CYCLES_VALU but multiplied by # of active threads). (per-simd)
      block SQ has 8 counters

  gpu-agent2 : SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts. (emulated)
      block SQ has 8 counters

  gpu-agent2 : TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for this counter.
      block TA has 2 counters

  gpu-agent2 : TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.
      block TA has 2 counters

  gpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.
      block TA has 2 counters

  gpu-agent2 : TCC_HIT[0-15] : Number of cache hits.
      block TCC has 4 counters

  gpu-agent2 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
      block TCC has 4 counters

  gpu-agent2 : TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
      block TCC has 4 counters

  gpu-agent2 : TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
      block TCC has 4 counters

  gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.
      block TCP has 4 counters

6.The derived metrics are defined on top of the basic counters using mathematical expression

rocprof --list-derived | grep -A 2  "gpu-agent2"

输出

  gpu-agent2 : TCC_EA1_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC EA1s.
      TCC_EA1_RDREQ_32B_sum = sum(TCC_EA1_RDREQ_32B,16)

  gpu-agent2 : TCC_EA1_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC EA1s.
      TCC_EA1_RDREQ_sum = sum(TCC_EA1_RDREQ,16)

  gpu-agent2 : TCC_EA1_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC EA1s.
      TCC_EA1_WRREQ_sum = sum(TCC_EA1_WRREQ,16)

  gpu-agent2 : TCC_EA1_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC EA1s.
      TCC_EA1_WRREQ_64B_sum = sum(TCC_EA1_WRREQ_64B,16)

  gpu-agent2 : TCC_WRREQ1_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
      TCC_WRREQ1_STALL_max = max(TCC_EA1_WRREQ_STALL,16)

  gpu-agent2 : RDATA1_SIZE : The total kilobytes fetched from the video memory. This is measured on EA1s.
      RDATA1_SIZE = (TCC_EA1_RDREQ_32B_sum*32+(TCC_EA1_RDREQ_sum-TCC_EA1_RDREQ_32B_sum)*64)

  gpu-agent2 : WDATA1_SIZE : The total kilobytes written to the video memory. This is measured on EA1s.
      WDATA1_SIZE = ((TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)*32+TCC_EA1_WRREQ_64B_sum*64)

  gpu-agent2 : FETCH_SIZE : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
      FETCH_SIZE = (TCC_EA_RDREQ_32B_sum*32+(TCC_EA_RDREQ_sum-TCC_EA_RDREQ_32B_sum)*64+RDATA1_SIZE)/1024

  gpu-agent2 : WRITE_SIZE : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
      WRITE_SIZE = ((TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)*32+TCC_EA_WRREQ_64B_sum*64+WDATA1_SIZE)/1024

  gpu-agent2 : WRITE_REQ_32B : The total number of 32-byte effective memory writes.
      WRITE_REQ_32B = (TCC_EA_WRREQ_sum-TCC_EA_WRREQ_64B_sum)+(TCC_EA1_WRREQ_sum-TCC_EA1_WRREQ_64B_sum)+(TCC_EA_WRREQ_64B_sum+TCC_EA1_WRREQ_64B_sum)*2

  gpu-agent2 : TA_BUSY_avr : TA block is busy. Average over TA instances.
      TA_BUSY_avr = avr(TA_TA_BUSY,16)

  gpu-agent2 : TA_BUSY_max : TA block is busy. Max over TA instances.
      TA_BUSY_max = max(TA_TA_BUSY,16)

  gpu-agent2 : TA_BUSY_min : TA block is busy. Min over TA instances.
      TA_BUSY_min = min(TA_TA_BUSY,16)

  gpu-agent2 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
      TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)

  gpu-agent2 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
      TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)

  gpu-agent2 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
      TCC_HIT_sum = sum(TCC_HIT,16)

  gpu-agent2 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
      TCC_MISS_sum = sum(TCC_MISS,16)

  gpu-agent2 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
      TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)

  gpu-agent2 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
      TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)

  gpu-agent2 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
      TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)

  gpu-agent2 : TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC instances.
      TCC_EA_WRREQ_64B_sum = sum(TCC_EA_WRREQ_64B,16)

  gpu-agent2 : TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
      TCC_WRREQ_STALL_max = max(TCC_EA_WRREQ_STALL,16)

  gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_sum : Total number of TCP stalls TA data interface.
      TCP_TCP_TA_DATA_STALL_CYCLES_sum = sum(TCP_TCP_TA_DATA_STALL_CYCLES,16)

  gpu-agent2 : TCP_TCP_TA_DATA_STALL_CYCLES_max : Maximum number of TCP stalls TA data interface.
      TCP_TCP_TA_DATA_STALL_CYCLES_max = max(TCP_TCP_TA_DATA_STALL_CYCLES,16)

  gpu-agent2 : VFetchInsts : The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
      VFetchInsts = (SQ_INSTS_VMEM_RD-TA_FLAT_READ_WAVEFRONTS_sum)/SQ_WAVES

  gpu-agent2 : VWriteInsts : The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
      VWriteInsts = (SQ_INSTS_VMEM_WR-TA_FLAT_WRITE_WAVEFRONTS_sum)/SQ_WAVES

  gpu-agent2 : FlatVMemInsts : The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
      FlatVMemInsts = (SQ_INSTS_FLAT-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVES

  gpu-agent2 : LDSInsts : The average number of LDS read or LDS write instructions executed per work item (affected by flow control).  Excludes FLAT instructions that read from or write to LDS.
      LDSInsts = (SQ_INSTS_LDS-SQ_INSTS_FLAT_LDS_ONLY)/SQ_WAVES

  gpu-agent2 : FlatLDSInsts : The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).
      FlatLDSInsts = SQ_INSTS_FLAT_LDS_ONLY/SQ_WAVES

  gpu-agent2 : VALUUtilization : The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
      VALUUtilization = 100*SQ_THREAD_CYCLES_VALU/(SQ_ACTIVE_INST_VALU*MAX_WAVE_SIZE)

  gpu-agent2 : VALUBusy : The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
      VALUBusy = 100*SQ_ACTIVE_INST_VALU*4/SIMD_NUM/GRBM_GUI_ACTIVE

  gpu-agent2 : SALUBusy : The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
      SALUBusy = 100*SQ_INST_CYCLES_SALU*4/SIMD_NUM/GRBM_GUI_ACTIVE

  gpu-agent2 : FetchSize : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
      FetchSize = FETCH_SIZE

  gpu-agent2 : WriteSize : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
      WriteSize = WRITE_SIZE

  gpu-agent2 : MemWrites32B : The total number of effective 32B write transactions to the memory
      MemWrites32B = WRITE_REQ_32B

  gpu-agent2 : L2CacheHit : The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
      L2CacheHit = 100*sum(TCC_HIT,16)/(sum(TCC_HIT,16)+sum(TCC_MISS,16))

  gpu-agent2 : MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
      MemUnitStalled = 100*max(TCP_TCP_TA_DATA_STALL_CYCLES,16)/GRBM_GUI_ACTIVE/SE_NUM

  gpu-agent2 : WriteUnitStalled : The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
      WriteUnitStalled = 100*TCC_WRREQ_STALL_max/GRBM_GUI_ACTIVE

  gpu-agent2 : LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
      LDSBankConflict = 100*SQ_LDS_BANK_CONFLICT/GRBM_GUI_ACTIVE/CU_NUM

  gpu-agent2 : GPUBusy : The percentage of time GPU was busy.
      GPUBusy = 100*GRBM_GUI_ACTIVE/GRBM_COUNT

  gpu-agent2 : Wavefronts : Total wavefronts.
      Wavefronts = SQ_WAVES

  gpu-agent2 : VALUInsts : The average number of vector ALU instructions executed per work-item (affected by flow control).
      VALUInsts = SQ_INSTS_VALU/SQ_WAVES

  gpu-agent2 : SALUInsts : The average number of scalar ALU instructions executed per work-item (affected by flow control).
      SALUInsts = SQ_INSTS_SALU/SQ_WAVES

  gpu-agent2 : SFetchInsts : The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
      SFetchInsts = SQ_INSTS_SMEM/SQ_WAVES

  gpu-agent2 : GDSInsts : The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
      GDSInsts = SQ_INSTS_GDS/SQ_WAVES

  gpu-agent2 : MemUnitBusy : The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
      MemUnitBusy = 100*max(TA_TA_BUSY,16)/GRBM_GUI_ACTIVE/SE_NUM

  gpu-agent2 : ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
      ALUStalledByLDS = 100*SQ_WAIT_INST_LDS*4/SQ_WAVES/GRBM_GUI_ACTIVE

7.Profing

tee input.txt<<-'EOF'
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0 : 1
gpu: 0
kernel:matrixTranspose
EOF

rocprof -i input.txt ./ROCmMatrixTranspose
cat /root/input.csv
rocprofv2 -i input.txt ./ROCmMatrixTranspose
rocprofv2 --hsa-trace ./ROCmMatrixTranspose

输出

RPL: on '240920_102257' from '/opt/rocm-6.2.0' in '/root'
RPL: profiling '"./ROCmMatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_240920_102257_47892'

RPL: result dir '/tmp/rpl_data_240920_102257_47892/input0_results_240920_102257'
ROCProfiler: input from "/tmp/rpl_data_240920_102257_47892/input0.xml"
  gpu_index = 0
  kernel = matrixTranspose
  range = 0:1
  4 metrics
    Wavefronts, VALUInsts, SALUInsts, SFetchInsts
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!

ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_240920_102257_47892/input0_results_240920_102257
File '/root/input.csv' is generating
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,SALUInsts,SFetchInsts
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,48178,48178,1048576,16,0,0,8,0,16,64,0x0,0x742031870880,65536.0000000000,14.0000000000,4.0000000000,3.0000000000

ROCProfilerV2: Collecting the following counters:
- Wavefronts
- VALUInsts
- SALUInsts
- SFetchInsts
Enabling Counter Collection
Device name AMD Radeon (TM) Pro VII
## Iteration (0) #################
PASSED!
Dispatch_ID(0), GPU_ID(1), Queue_ID(1), Process_ID(48209), Thread_ID(48209), Grid_Size(1048576), Workgroup_Size(16), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("matrixTranspose(float*, float*, int) (.kd)"), Begin_Timestamp(951172884265490), End_Timestamp(951172884454463), Correlation_ID(0), SALUInsts(4.000000), SFetchInsts(3.000000), VALUInsts(14.000000), Wavefronts(65536.000000)

http://www.kler.cn/a/313101.html

相关文章:

  • HTMLCSS: 实现可爱的冰墩墩
  • 吾店云介绍 – 中国人的WordPress独立站和商城系统平台
  • uniapp分享功能
  • 工业通信协议对比:OPC-UA、Modbus、MQTT、HTTP
  • GitHub 上的开源项目推荐
  • 2024最新版JavaScript逆向爬虫教程-------基础篇之Chrome开发者工具学习
  • 漏洞挖掘 | Selenium Grid 中的 SSRF
  • F28335中断系统
  • React学习笔记(三)——React 组件通讯
  • VUE-CLI配置全局SCSS变量
  • OpenCV_距离变换的图像分割和Watershed算法详解
  • openCV3.0 C++ 学习笔记补充(自用 代码+注释)---持续更新 三(61-)
  • Hexo博客私有部署Twikoo评论系统并迁移评论记录(自定义邮件回复模板)
  • Pandas中df常用方法介绍
  • MATLAB画图,曲线图如何绘制美观,曲线图10种美化方法
  • Python 异常控制详解:try-except 的应用与多种异常处理策略
  • QEMU 运行Win11 成功的例子
  • OpenCVHaar级联器实现人脸捕捉和微笑检测
  • 煤矿智慧矿井数据集 (1.煤矿采掘工作面智能分析数据集2.煤矿井下钻场智能分析数据集 )
  • en造数据结构与算法C# 群组行为优化 和 头鸟控制
  • Flink 中 Checkpoint 的底层原理和机制
  • Java客户端SpringDataRedis(RedisTemplate使用)
  • Neko一个在Docker环境下的虚拟浏览器
  • 大数据-142 - ClickHouse 集群 副本和分片 Distributed 附带案例演示
  • Day69补 前后端分离思想
  • JAVA毕业设计176—基于Java+Springboot+vue3的交通旅游订票管理系统(源代码+数据库)