当前位置: 首页 > article >正文

香橙派AI Pro算子开发(二)kernel直调Add算子

一、代码准备

这里的代码是从官方仓库拷贝而来,可以参考上一篇文章[香橙派AI Pro算子开发(一)]。(https://blog.csdn.net/weixin_44130162/article/details/145488713?spm=1011.2415.3001.5331)

git clone https://gitee.com/ascend/samples
cd samples/operator/ascendc/tutorials/AddCustomSample

二、代码执行

进入目录后简介的结构为如下,这里有在框架里调用add算子,也有利用kernel直接调用的方式,这里先去解读Kernel调用方式

AddCustomSample
|--FrameworkLaunch  
|--KernelLaunch  
|--README.md

进入指定目录后执行run.sh脚本

cd KernelLaunch/AddKernelInvocationNeo
bash run.sh -r npu -v Ascend310B4

编译输出信息如下:

Current compile soc version is Ascend310B4
[INFO]: /usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/custom/ is empty
[INFO]: /usr/local/Ascend/ascend-toolkit/latest/opp/op_proto/custom/ is empty
[INFO]: / is empty
[INFO]: / is empty
[INFO]: /usr/local/Ascend/ascend-toolkit/latest/opp/framework/custom/ is empty
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build
[  2%] Creating directories for 'ascendc_kernels_npu_precompile'
[  5%] No download step for 'ascendc_kernels_npu_precompile'
[  8%] No update step for 'ascendc_kernels_npu_precompile'
[ 11%] No patch step for 'ascendc_kernels_npu_precompile'
[ 13%] Performing configure step for 'ascendc_kernels_npu_precompile'
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    DYNAMIC_MODE
    INCLUDE_DIR


-- Build files have been written to: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_precompile-prefix/src/ascendc_kernels_npu_precompile-build
[ 16%] Performing build step for 'ascendc_kernels_npu_precompile'
[100%] Building CXX object CMakeFiles/precompile_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o
[100%] Built target precompile_obj
[100%] Built target check_src_template
[ 19%] No install step for 'ascendc_kernels_npu_precompile'
[ 22%] Completed 'ascendc_kernels_npu_precompile'
[ 22%] Built target ascendc_kernels_npu_precompile
[ 25%] Creating directories for 'ascendc_kernels_npu_preprocess'
[ 27%] No download step for 'ascendc_kernels_npu_preprocess'
[ 30%] No update step for 'ascendc_kernels_npu_preprocess'
[ 33%] No patch step for 'ascendc_kernels_npu_preprocess'
[ 36%] Performing configure step for 'ascendc_kernels_npu_preprocess'
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_preprocess-prefix/src/ascendc_kernels_npu_preprocess-build
[ 38%] Performing build step for 'ascendc_kernels_npu_preprocess'
[100%] Building CXX object CMakeFiles/preprocess_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o
[100%] Building CXX object CMakeFiles/m200_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o
[100%] Built target preprocess_obj
[100%] Built target m200_obj
/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin/ld.lld -m aicorelinux -Ttext=0 /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_preprocess-prefix/src/ascendc_kernels_npu_preprocess-build/CMakeFiles/m200_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o -o /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_preprocess-prefix/src/ascendc_kernels_npu_preprocess-build/CMakeFiles/m200_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o[100%] Built target merge_m200_obj_text
[100%] Built target _host_cpp
[ 41%] No install step for 'ascendc_kernels_npu_preprocess'
[ 44%] Completed 'ascendc_kernels_npu_preprocess'
[ 44%] Built target ascendc_kernels_npu_preprocess
[ 50%] Creating directories for 'ascendc_kernels_npu_host'
[ 50%] Creating directories for 'ascendc_kernels_npu_device'
[ 52%] No download step for 'ascendc_kernels_npu_device'
[ 55%] No download step for 'ascendc_kernels_npu_host'
[ 61%] No update step for 'ascendc_kernels_npu_device'
[ 61%] No update step for 'ascendc_kernels_npu_host'
[ 66%] No patch step for 'ascendc_kernels_npu_device'
[ 66%] No patch step for 'ascendc_kernels_npu_host'
[ 72%] Performing configure step for 'ascendc_kernels_npu_device'
[ 72%] Performing configure step for 'ascendc_kernels_npu_host'
-- The C compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    ASCEND_PYTHON_EXECUTABLE


-- Build files have been written to: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_host-prefix/src/ascendc_kernels_npu_host-build
-- Generating done
-- Build files have been written to: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_device-prefix/src/ascendc_kernels_npu_device-build
[ 75%] Performing build step for 'ascendc_kernels_npu_host'
[ 77%] Performing build step for 'ascendc_kernels_npu_device'
[100%] Building CXX object CMakeFiles/host_bisheng_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o
[100%] Building CXX object CMakeFiles/device_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/auto_gen/ascendc_kernels_npu/auto_gen_add_custom.cpp.o
[100%] Built target host_bisheng_obj
[ 80%] Performing install step for 'ascendc_kernels_npu_host'
Consolidate compiler generated dependencies of target host_bisheng_obj
[100%] Building CXX object CMakeFiles/host_bisheng_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o
[100%] Built target host_bisheng_obj
Install the project...
-- Install configuration: "Debug"
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_host_dir/./objects-Debug/host_bisheng_obj/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o
[ 83%] Completed 'ascendc_kernels_npu_host'
[ 83%] Built target ascendc_kernels_npu_host
[100%] Built target device_obj
/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin/ld.lld  -m aicorelinux   -Ttext=0 /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_device-prefix/src/ascendc_kernels_npu_device-build/CMakeFiles/device_obj.dir/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/auto_gen/ascendc_kernels_npu/auto_gen_add_custom.cpp.o -static -o /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_merge_obj_dir/device.o
[100%] Built target merge_device_obj
[ 86%] No install step for 'ascendc_kernels_npu_device'
[ 88%] Completed 'ascendc_kernels_npu_device'
[ 88%] Built target ascendc_kernels_npu_device
[ 88%] Built target ascendc_kernels_npu_merge_obj
[ 91%] Building CXX object CMakeFiles/ascendc_kernels_npu_host_stub_obj.dir/auto_gen/ascendc_kernels_npu/host_stub.cpp.o
[ 91%] Built target ascendc_kernels_npu_host_stub_obj
[ 94%] Linking CXX shared library lib/libascendc_kernels_npu.so
/usr/local/Ascend/ascend-toolkit/latest/bin/ascendc_pack_kernel /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/CMakeFiles/ascendc_kernels_npu_host_stub_obj.dir/auto_gen/ascendc_kernels_npu/host_stub.cpp.o /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_merge_obj_dir/device.o 0 /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/CMakeFiles/ascendc_kernels_npu_host_stub_obj.dir/auto_gen/ascendc_kernels_npu/host_stub.cpp.o
recompile: /usr/bin/c++ -fPIC -g -Wl,-z,relro -Wl,-z,now -Wl,-z,noexecstack -shared -Wl,-soname,libascendc_kernels_npu.so -o lib/libascendc_kernels_npu.so CMakeFiles/ascendc_kernels_npu_host_stub_obj.dir/auto_gen/ascendc_kernels_npu/host_stub.cpp.o /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/build/ascendc_kernels_npu_host_dir/objects-Debug/host_bisheng_obj/home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/add_custom.cpp.o    -L/usr/local/Ascend/ascend-toolkit/latest/lib64  -L/usr/local/Ascend/ascend-toolkit/latest/tools/simulator/Ascend310B4/lib  /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendc_runtime.a -lascend_dump -lc_sec
[ 94%] Built target ascendc_kernels_npu
[ 97%] Building CXX object CMakeFiles/ascendc_kernels_bbit.dir/main.cpp.o
[100%] Linking CXX executable ascendc_kernels_bbit
[100%] Built target ascendc_kernels_bbit
-- Install configuration: "Debug"
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/out/lib/libascendc_kernels_npu.so
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/out/include
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/out/include/ascendc_kernels_npu
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/out/include/ascendc_kernels_npu/aclrtlaunch_add_custom.h
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/out/include/ascendc_kernels_npu/aclrtlaunch_triple_chevrons_func.h
-- Installing: /home/HwHiAiUser/MyAscend/samples/operator/ascendc/tutorials/AddCustomSample/KernelLaunch/AddKernelInvocationNeo/out/bin/ascendc_kernels_bbit
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:38.953.213 [task_fail_callback_manager.cc:52] 25315 TaskFailCallBackManager: Constructor.
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:39.907.190 [msprof_callback_impl.cpp:336] >>> (tid:25315) Started to register profiling ctrl callback.
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:39.907.731 [msprof_callback_impl.cpp:343] >>> (tid:25315) Started to register profiling hash id callback.
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:39.907.907 [prof_atls_plugin.cpp:117] (tid:25315) RegisterProfileCallback, callback type is 7
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:39.908.040 [msprof_callback_impl.cpp:350] >>> (tid:25315) Started to register profiling enable host freq callback.
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:39.908.157 [prof_atls_plugin.cpp:117] (tid:25315) RegisterProfileCallback, callback type is 8
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:39.949.354 [runtime.cc:5471] 25315 GetVisibleDevices: ASCEND_RT_VISIBLE_DEVICES param was not set
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:40.020.824 [prof_atls_plugin.cpp:210] (tid:25315) Module[7] register callback of ctrl handle.
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:40.031.775 [prof_atls_plugin.cpp:210] (tid:25315) Module[69] register callback of ctrl handle.
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:40.064.053 [prof_atls_plugin.cpp:210] (tid:25315) Module[48] register callback of ctrl handle.
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:40.064.506 [prof_atls_plugin.cpp:210] (tid:25315) Module[45] register callback of ctrl handle.
[INFO] GE(25315,ascendc_kernels_bbit):2025-02-08-11:35:41.364.715 [op_tiling_manager.cc:109]25315 ~FuncPerfScope:[GEPERFTRACE] The time cost of OpTilingManager::LoadSo is [1300008] micro second.
[INFO] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:41.805.336 [prof_atls_plugin.cpp:210] (tid:25315) Module[6] register callback of ctrl handle.
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:43.991.935 [msprof_callback_impl.cpp:89] >>> (tid:25315) MsprofCtrlCallback called, type: 255
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:43.997.124 [ai_drv_dev_api.cpp:333] >>> (tid:25315) Succeeded to DrvGetApiVersion version: 0x72313
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.007.731 [client_manager.cpp:462][GetClientRunMode][tid:25315] runningMode:0
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.007.935 [client_manager.cpp:126][GetInstance][tid:25315] [ClientManager] Current mode:2
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.008.075 [thread_mode_manager.cpp:70][Open][tid:25315] [ThreadModeManager] enter into open process deviceId[0] rankSize[0]
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.009.349 [thread_mode_manager.cpp:280][HandleAICPUPackage][tid:25315] begin load aicpu package dstPath[/home/HwHiAiUser/], srcpath[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/] file[Ascend-aicpu_syskernels.tar.gz]
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.009.567 [package_worker.cpp:338][LoadAICPUPackageForThreadMode][tid:25315] Package checkcode is [57460226]
[WARNING] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.009.740 [package_worker.cpp:342][LoadAICPUPackageForThreadMode][tid:25315] Open aicpu_package_install.info verifyFile[/home/HwHiAiUser/aicpu_package_install.info], strerror[File exists]
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.010.956 [thread_mode_manager.cpp:280][HandleAICPUPackage][tid:25315] begin load aicpu package dstPath[/home/HwHiAiUser/], srcpath[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/] file[Ascend-aicpu_extend_syskernels.tar.gz]
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.011.174 [package_worker.cpp:338][LoadAICPUPackageForThreadMode][tid:25315] Package checkcode is [8052802]
[WARNING] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.011.343 [package_worker.cpp:342][LoadAICPUPackageForThreadMode][tid:25315] Open aicpu_package_install.info verifyFile[/home/HwHiAiUser/extend_aicpu_package_install.info], strerror[File exists]
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.846.755 [thread_mode_manager.cpp:159][SetAICPUProfilingCallback][tid:25315] [ThreadModeManager] profiling callback is nullptr, skip set aicpu profiling callback
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.008 [aicpusd_interface_process.cpp:525][TryGetLogLevelFromParentProcess][tid:25315] get ASCEND_GLOBAL_LOG_LEVEL [] and ASCEND_GLOBAL_EVENT_ENABLE []
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.256 [aicpusd_interface_process.cpp:467][GetCurrentRunMode][tid:25315] Current aicpu mode is offline (call by api).
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.362 [aicpusd_drv_manager.cpp:327][MarkMdc][tid:25315] Get hardware version[7] success.
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.413 [aicpusd_drv_manager.cpp:190][GetNormalAicpuInfo][tid:25315] aicpuBitMap[8], aicpuNum[1].
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.452 [aicpusd_drv_manager.cpp:224][GetCcpuInfo][tid:25315] ccpuBitMap[7].
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.509 [aicpusd_drv_manager.cpp:110][GetNormalAicpuDCpuInfo][tid:25315] GetNormalAicpuDCpuInfo, deviceId[0], aicpu_num[1], aicpu_os_sched[281470681743361], ccpu_num[255082402676739], ccpu_os_sched[255082402676737], dcpu_num[255082402676736], dcpu_os_sched[281470681743361], tscpu_num[187647121162240], tscpu_os_sched[0].
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.547 [aicpusd_drv_manager.cpp:306][InitDrvMgrCaluniqueVfId][tid:25315] InitDrvMgr uniqueVfId=0, deviceId=0
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.576 [aicpusd_drv_manager.cpp:377][InitDrvMgr][tid:25315] host pid[25315], host proc name[], vf id[0], first aicpu index[0], aicpu num[1], dcpu base index[3], dcpu num[0].
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.847.797 [aicpusd_resource_manager.cpp:257][InitBufManager][tid:25315] Aicpu schedule SetBuffCfg successed!
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.848.490 [aicpusd_worker.cpp:74][ThreadPool][tid:25315] ThreadPool
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.848.741 [aicpusd_worker.cpp:274][AddPidToTask][tid:25459] Bind pid by hal.
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.848.788 [aicpusd_worker.cpp:276][AddPidToTask][tid:25459] AddPidToTask by halBindCgroup
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.868.385 [aicpusd_worker.cpp:323][SetAffinityBySelf][tid:25459] [hw]SetAffinityBySelf, physIndex[3], devNum[0]
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.868.484 [aicpusd_worker.cpp:389][SetAffinity][tid:25459] aicpu bind tid by self, index[0], deviceId[0], res[0].
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.868.914 [aicpusd_cust_so_manager.cpp:77][InitAicpuCustSoManager][tid:25315] cust so dir name is /home/HwHiAiUser/cust_aicpu_0_0_25315/.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.874.969 [raw_device.cc:237] 25315 Init: isAddrFlat:0
[EVENT] DRV(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.875.463 [grp_mng.c:81][bufmng] [halGrpCreate 81] Create grp. (grp_name=private_buff_grp_25315; grp_id=1; max_size=0; cache_flag=0)
[EVENT] DRV(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.875.530 [grp_mng.c:144][bufmng] [halGrpAddProc 144] add grp succ, name:private_buff_grp_25315, pid:25315, grp_id:1, admin:1 alloc:1 read:1 write:1 
[EVENT] DRV(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.875.732 [buff_recycle.c:758][bufmng] [procMngInit 758] poolId 1 add task node uid 2 pid 25315
[EVENT] DRV(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.875.776 [drv_buff_mbuf.c:65][bufmng] [mbufSetPrivFlag 65] Set mbuf priv flag sucess. (flag=0, g_mbuf_priv_flag=0)
[EVENT] DRV(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.875.808 [grp_mng.c:212][bufmng] [halGrpAttach 212] grp attach, grp_name:private_buff_grp_25315, grp_id:1, timeout:1000
[EVENT] DRV(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.875.840 [drv_buff_memzone.c:450][bufmng] [memzone_cfg 450] BuffCfg success. (cfg num=4; huge_prior=393216; normal=393216; huge_only=393216; dvpp_huge_prior=393216; dvpp_normal=393216; dvpp_huge_only=393216)
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.878.908 [tsd_client.cpp:172][TsdCapabilityGet][tid:25315] TsdCapabilityGet Begin.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.879.194 [engine.cc:76] 25315 Engine: Constructor.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.879.335 [stars_engine.cc:41] 25315 StarsEngine: Constructor.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.888.658 [npu_driver.cc:5784] 25463 GetDeviceStatus: GetDeviceStatus status=1.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.894.041 [device_error_proc.cc:446] 25315 GetTschCapability: Tsch not support capability feature, use old solution.
[INFO] TDT(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.894.378 [client_manager.cpp:195][SetProfilingCallback][tid:25315] [TsdClient] set profiling callback success
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.929.211 [aicpusd_cust_so_manager.cpp:401][DeleteCustSoDir][tid:25315] Access cust so dir /home/HwHiAiUser/cust_aicpu_0_0_25315/ failed, error is No such file or directory.
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.929.335 [dump_task.cpp:1949][ClearResource][tid:25315] clear all resource of data dump
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.929.376 [aicpusd_interface.cpp:229][StopAICPUScheduler][tid:25315] Success to stop aicpu scheduler, deviceId[0], hostPid[25315].
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.930.066 [stars_engine.cc:48] 25315 ~StarsEngine: Destructor.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.930.287 [engine.cc:82] 25315 ~Engine: Destructor.
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.930.818 [msprof_callback_impl.cpp:89] >>> (tid:25315) MsprofCtrlCallback called, type: 3
[EVENT] PROFILING(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.930.990 [ai_drv_dev_api.cpp:333] >>> (tid:25315) Succeeded to DrvGetApiVersion version: 0x72313
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:44.932.634 [aicpusd_worker.cpp:168][WaitForStop][tid:25315] WaitForStop begin.
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:48.040.414 [aicpusd_worker.cpp:174][WaitForStop][tid:25315] WaitForStop end.
[EVENT] CCECPU(25315,ascendc_kernels_bbit):2025-02-08-11:35:48.054.475 [aicpusd_mpi_mgr.cpp:48][PrintStatisticInfo][tid:25315] Mpi Dvpp event statistic: [0]
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:48.526.473 [task_fail_callback_manager.cc:57] 25315 ~TaskFailCallBackManager: Destructor.
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:48.534.633 [runtime.cc:2033] 25315 ~Runtime: deconstruct runtime
[INFO] RUNTIME(25315,ascendc_kernels_bbit):2025-02-08-11:35:48.535.573 [runtime.cc:2040] 25315 ~Runtime: wait monitor success, use=0.
6c706024470860de357301b17aea751f  output/golden.bin
6c706024470860de357301b17aea751f  output/output_z.bin
error ratio: 0.0000, tolrence: 0.0010
test pass

三、代码解读

进入文件的结构如下:

AddKernelInvocationNeo/
│-- CMakeLists.txt
│-- README.md
│-- add_custom.cpp						//自定义算子相关的 C++ 代码,device侧
│-- data_utils.h
│-- main.cpp							//自定义算子相关的 C++ 代码,host侧
│-- build/
│-- ascendc_kernels_bbit/        		//Ascend 设备端内核 代码,涉及 AI Core 计算。
│-- cmake/
│-- input/
│-- output/
│-- out/
│-- scripts/
│-- run.sh

这里核心的代码就是add_custom.cpp和main.cpp两个文件,和helloworld一致,这两个文件依然是一个表示host侧的执行,另一个表示在device侧的执行。接下来逐行解析add_custom.cpp和main.cpp两个文件

3.1 add_custom.cpp

AscendC是有标准的编程范式,即Ascend C的TPIPE流水线编程范式,一方面可以规范编程,另一方面可以帮助我们更好地去写出高质量的代码,提高硬件使用率。
TPIPE(Task Pipeline)流水线编程是 华为 Ascend AI 处理器(如 Ascend 310/910)在 Ascend C 语言(即 AI Core 端编程语言)中提供的一种 任务级流水线调度机制。它主要用于 优化 AI Core 计算效率,提高计算并行度,类似于 CUDA 的 流(stream)和管道(pipeline) 概念。
AddCustom的vector算子计算公式为:z=x+y。
这里需要引入一个Global Memory的概念,global memory是全局内存,这里GM设计与AI core或者NPU的架构设计是有关的,存在内存就可以称之为Global memory,而与之相对的就有local memory,local memory是在NPU中的内存空间,包括有unified buffer和其L0 buffer等。

这里其实就涉及到了数据搬运。我们有一个固定的内存地址(GM),这里存放着所有的需要的数据,然后有多个核心,每一个核心去计算一块区域的数据,且每个核心的数据都不重复。因为在整个AI core的设计中,不是所有的数据都一股脑进入计算,而是数据分批次计算,那么就需要对每个核心负责的区域继续切分。可以参考:Ascend C算子性能优化实用技巧01——流水优化

在这里插入图片描述

· CopyIn负责搬入操作:将输入数据从Global Memory搬运到Local Memory(VECIN用于表达矢量计算搬入数据的存放位置),完成搬运后执行入队列操作;

· Compute负责矢量指令计算操作:完成队列出队后,从Local Memory获取数据并计算,计算完成后执行入队操作;

· CopyOut负责搬出操作:完成队列出队后,将计算结果从Local Memory(VECOUT用于表达矢量计算搬出数据的存放位置)搬运到GM。

在这里插入图片描述

需要处理的数据被切分成n片,每个并行任务(Stage1、2、3)需要依次完成n个数据切片的处理。Progress1、2、3代表处理的数据分片,对于同一片数据,Stage1、Stage2、Stage3之间的处理具有依赖关系,需要串行处理;不同的数据切片,同一时间点,可以有多个任务在并行处理,由此达到任务并行、提升性能的目的。

从这里就可以看懂以下这段代码的设计了

constexpr int32_t TOTAL_LENGTH = 8 * 2048;                            // total length of data
constexpr int32_t USE_CORE_NUM = 8;                                   // num of core used
constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
constexpr int32_t TILE_NUM = 8;                                       // split data into 8 tiles for each core
constexpr int32_t BUFFER_NUM = 2;                                     // tensor num for each queue
constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // separate to 2 parts, due to double buffer

具体可以查看Ascend C算子性能优化实用技巧01——流水优化

在这里插入图片描述

/**
 * @file add_custom.cpp
 *
 * Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 */
 
#include "kernel_operator.h"

// 此处是Ascend算子的tiling算力,因为大多数情况下,AI CORE的unified buffer无法容纳整个算子的输入与输出,需要每次搬运一部分然后再搬出,以此往复。
constexpr int32_t TOTAL_LENGTH = 8 * 2048;                            // total length of data
constexpr int32_t USE_CORE_NUM = 8;                                   // num of core used
constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
constexpr int32_t TILE_NUM = 8;                                       // split data into 8 tiles for each core
constexpr int32_t BUFFER_NUM = 2;                                     // tensor num for each queue
constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // separate to 2 parts, due to double buffer

class KernelAdd {
public:
    __aicore__ inline KernelAdd() {}
    // 数据初始化
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
    {
    	//绑定全局内存,把x,y,z绑定到xGM,yGM,zGM
    	//这里的half其实是因为ai core的精度就只能支持到bf16
        xGm.SetGlobalBuffer((__gm__ half *)x + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
        yGm.SetGlobalBuffer((__gm__ half *)y + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
        zGm.SetGlobalBuffer((__gm__ half *)z + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
        //初始化双缓冲队列
        pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
        pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));
        pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    }
    //process计算流程
    __aicore__ inline void Process()
    {
    	//每个计算核心 需要处理 TILE_NUM 块数据,每个块再分成 BUFFER_NUM 份进行 双缓冲,实现 流水线并行计算。
        int32_t loopCount = TILE_NUM * BUFFER_NUM;
        for (int32_t i = 0; i < loopCount; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t progress)
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        AscendC::DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);
        AscendC::DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH);
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }
    __aicore__ inline void Compute(int32_t progress)
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
        AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
        outQueueZ.EnQue<half>(zLocal);
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    __aicore__ inline void CopyOut(int32_t progress)
    {
        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
        AscendC::DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);
        outQueueZ.FreeTensor(zLocal);
    }
    
//TPIPE流水线,用于优化数据传输,提高计算并行度
//inQueueX, inQueueY	AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM>	输入数据队列,用于存储 x 和 y 数据
//outQueueZ	AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM>	输出数据队列,存储 z 计算结果
//xGm, yGm, zGm	AscendC::GlobalTensor<half>	全局存储(GM)数据张量,用于存储 x, y, z
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;
    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ;
    AscendC::GlobalTensor<half> xGm;
    AscendC::GlobalTensor<half> yGm;
    AscendC::GlobalTensor<half> zGm;
};

//kernel函数
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    KernelAdd op;
    op.Init(x, y, z);
    op.Process();
}
// host端函数
#ifndef ASCENDC_CPU_DEBUG
void add_custom_do(uint32_t blockDim, void *stream, uint8_t *x, uint8_t *y, uint8_t *z)
{
    add_custom<<<blockDim, nullptr, stream>>>(x, y, z);
}
#endif

3.2 main.cpp

/**
 * @file main.cpp
 *
 * Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 */
// 这里定义了CPU debug模式和Ascend设备模式两种
#include "data_utils.h"
#ifndef ASCENDC_CPU_DEBUG
#include "acl/acl.h"
extern void add_custom_do(uint32_t blockDim, void *stream, uint8_t *x, uint8_t *y, uint8_t *z);
#else
#include "tikicpulib.h"
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z);
#endif

int32_t main(int32_t argc, char *argv[])
{
	//定义变量
    uint32_t blockDim = 8;
    size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);
    size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);

//CPU debug模式
#ifdef ASCENDC_CPU_DEBUG
    uint8_t *x = (uint8_t *)AscendC::GmAlloc(inputByteSize);
    uint8_t *y = (uint8_t *)AscendC::GmAlloc(inputByteSize);
    uint8_t *z = (uint8_t *)AscendC::GmAlloc(outputByteSize);

	//ReadFile来源于data_utils.h
    ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
    ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize);
	
    AscendC::SetKernelMode(KernelMode::AIV_MODE);
    //在CPU执行add_custom计算
    ICPU_RUN_KF(add_custom, blockDim, x, y, z); // use this macro for cpu debug
	//写入二进制文件
    WriteFile("./output/output_z.bin", z, outputByteSize);
	//释放内存
    AscendC::GmFree((void *)x);
    AscendC::GmFree((void *)y);
    AscendC::GmFree((void *)z);

//Ascend设备模式
#else
	//初始化Ascend ACL运行时
    CHECK_ACL(aclInit(nullptr));
    int32_t deviceId = 0;
    //绑定设备ID
    CHECK_ACL(aclrtSetDevice(deviceId));
    aclrtStream stream = nullptr;
    //创建计算流程,用于执行异步任务
    CHECK_ACL(aclrtCreateStream(&stream));
	
	//分配Host device的内存
    uint8_t *xHost, *yHost, *zHost;
    uint8_t *xDevice, *yDevice, *zDevice;

    CHECK_ACL(aclrtMallocHost((void **)(&xHost), inputByteSize));
    CHECK_ACL(aclrtMallocHost((void **)(&yHost), inputByteSize));
    CHECK_ACL(aclrtMallocHost((void **)(&zHost), outputByteSize));
    CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMalloc((void **)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));

    ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
    ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);
	//设备内存数据传输,从device到host
    CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
    CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
	//调用ai core计算
    add_custom_do(blockDim, stream, xDevice, yDevice, zDevice);
    CHECK_ACL(aclrtSynchronizeStream(stream));
	//设备内存回传,从device到hose
    CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
    WriteFile("./output/output_z.bin", zHost, outputByteSize);
	
	//内存释放
    CHECK_ACL(aclrtFree(xDevice));
    CHECK_ACL(aclrtFree(yDevice));
    CHECK_ACL(aclrtFree(zDevice));
    CHECK_ACL(aclrtFreeHost(xHost));
    CHECK_ACL(aclrtFreeHost(yHost));
    CHECK_ACL(aclrtFreeHost(zHost));

    CHECK_ACL(aclrtDestroyStream(stream));
    CHECK_ACL(aclrtResetDevice(deviceId));
    CHECK_ACL(aclFinalize());
#endif
    return 0;
}


http://www.kler.cn/a/541550.html

相关文章:

  • 参考数据和主数据:构建数据管理的基石
  • macbook键盘进残渣,按键难回弹的简单处理方法
  • vue动态table 动态表头数据+动态列表数据
  • 51c自动驾驶~合集49
  • 代码随想录二刷|回溯4
  • 我们来学人工智能 -- 将Ollama已下载的模型从C盘迁出
  • git如何把多个commit合成一个
  • Machine Learning:Introduction
  • 【Ubuntu VScode Remote SSH 问题解决】Resolver error: Error: XHR failed
  • 如何使用 DataX 连接 Easysearch
  • 鸿蒙NEXT开发-鸿蒙三方库
  • html文件怎么转换成pdf文件,2025最新教程
  • electron.vite 项目创建以及better-sqlite3数据库使用
  • 基于SpringBoot的公益社团管理系统
  • Windows逆向工程入门之汇编数据存储\宽度,内存地址及边界,数据截断处理
  • 003 Linux驱动开发——第一个简单开发实验
  • python动物识别深度学习分析系统
  • 2.1 JUnit 5 测试发现机制详解
  • Dify 框架连接 PGSQL 数据库与 Sandbox 环境下的 Linux 系统调用权限问题
  • 什么是动态路由和嵌套路由?
  • Unity快速入门2 - 3D渲染
  • 【Python深入浅出】Python3邂逅MySQL:开启数据交互之旅
  • Python+wxauto:实现电脑端微信程序自动化
  • JDBC数据库连接池及相关练习(学习自用)
  • 云原生周刊:DeepSeek 颠覆人工智能
  • 基于springboot+vue的校园招聘网站的设计与实现