当前位置：首页 > article >正文

TensorRT-LLM笔记

article 2025/2/21 3:11:51

原文链接

开启inflight-batching, client侧需要使用inflight_batcher_llm_client.py：

python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer-dir ${HF_LLAMA_MODEL}

bad_words: output中不允许出现的词语；

stop_words: output生成到这些词，则停止；

build engine常用参数：

--gpt_attention_plugin float16

--gemm_plugin float16

--context_fmha enable

--kv_cache_type paged：Paged KV Cache?

Best Practices for Tuning the Performance of TensorRT-LLM — tensorrt_llm documentation

--multiple_profiles: 允许trtllm多次尝试，其自动选取性能最好的；

1. 默认打开：--gpt_attention_plugin：in-place update on KV cache；减少了显存占用，减少了显存copy;

2. 默认打开：--context_fmha：attention计算这里，是否采用fused kernel；短句子，用vanilla；长句子，用FlashAttention和FlashAttention2; 官网介绍

3. 默认打开：--remove_input_padding：输入序列末尾不再padding；（我猜就是为inflight-batching？）

4. 默认打开：--paged_kv_cache：Paged Attention;

5. 默认打开: inflight-batching; 当1、3、4都打开时，该功能自动打开；将context阶段的seq和generate阶段的seq，放在同一个batch里，interleave起来进行计算？

数据结构：总览

九，数据类型存储

大模型微调

ubuntu离线部署ollama

docker build cache 占用磁盘空间很高

近期学习前端的心得

【面试经典150】day 11

Python 作用域浅析

brew 下载过慢, 切换使用国内源

基于向量检索的RAG大模型

探索设计模式：命令模式