decoding phase, 推理1个新token,


batch_size*num_heads和SM数目相比较小时:有些SM会空闲;加了--multi_block_mode,似乎是将input context再进行划分,原来1个SM干的活儿,分给多个SM来干,让所有SM都并行忙碌起来;


"we only use multi-block in generation phase (generating new token). In context phase, we have enough blocks to run in parallel and we don't need to use multi-block."
"take H100-SXM as an example, you have 132 SMs, and let us say the batch size is 1, num heads is 16, then normally we can split the sequence into (132/16 = 8) blocks to fully utilize all SMs, but if the sequence length is quite small like 1K, it might not worth 8 blocks per sequence (maybe fewer)."



# Build LLaMA v3 70B TP=8 using Meta checkpoints directly.
python convert_checkpoint.py --meta_ckpt_dir ./tmp/llama/70B/ \
                            --output_dir ./tllm_checkpoint_8gpu_tp8 \
                            --dtype float16 \
                            --tp_size 8


# Build LLaMA v3 70B using 4-way tensor parallelism and 2-way pipeline parallelism.
python convert_checkpoint.py --model_dir ./tmp/llama/70B/hf/ \
                            --output_dir ./tllm_checkpoint_8gpu_tp4_pp2 \
                            --dtype float16 \
                            --tp_size 4 \
                            --pp_size 2


Total memory = (Model size + KV cache size + Activation memory) / Parallelism


  • The model size is the number of parameters * the size of data type.
  • The KV cache size is the total number of tokens * the size of KV cache data type * the number of layers * the KV hidden dimension
  • The activation memory is determined by TRT engine, which can be a few GBs regardless of the degree of parallelism used

For LLaMA v2 70B FP16 weights + FP8 KV cache, the model size is 70B parameters * 2 bytes = 140GB. The KV cache size is 32K tokens * 1 bytes * 80 layers * 2048 KV hidden dimension = 5GB per 32K tokens. We have 145GB spread across 8 GPUs. The end result is ~18GB per GPU plus some GBs of flat scratch/activation memory allocated by TRT engine and the TRT-LLM runtime.

Note that the KV hidden dimension is derived by the number of KV heads times hidden dimension of each head. LLaMA v2 70B has hidden dimension of 8192, and uses grouped-query attention where 8 key heads and 8 value heads are associated with 64 query heads. Each head has hidden dimension of 8192/64 = 128. So the hidden dimension for KV in total is 128 * 8 * 2 = 2048. (2是K和V)

The total number of tokens is determined by beam width, batch size, and maximum sequence length.

--use_paged_context_fmha: 似乎是KV cache分页


LLama70B, 1张卡放不下,8张卡Tensor并行:

git-lfs clone https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k/

python examples/llama/convert_checkpoint.py --model_dir ./Llama-3-70B-Instruct-Gradient-1048k/ \
                              --output_dir /tmp/llama-3-70B-1048k/trt_ckpts \
                              --dtype float16 \
                              --tp_size 8

python -m tensorrt_llm.commands.build --checkpoint_dir /tmp/llama-3-70B-1048k/trt_ckpts \
            --output_dir /tmp/llama-3-70B-1048k/trt_engines \
            --gemm_plugin float16 \
            --max_num_tokens 4096 \
            --max_batch_size 1 \
            --max_seq_len 1048576 \
            --use_paged_context_fmha enable \
            --workers 8

mpirun -n 8 --allow-run-as-root python examples/eval_long_context.py  --task passkey \
                                      --engine_dir /tmp/llama-3-70B-1048k/trt_engines \
                                      --tokenizer_dir ./Llama-3-70B-Instruct-Gradient-1048k/ \
                                      --stop_idx 1 \
                                      --max_input_length 1048566 \
                                      --enable_chunked_context \
                                      --max_tokens_in_paged_kv_cache 1100000


build那里指定workers为8,8张GPU卡每个负责一个model partition,同时build,加快build速度;

执行run,用的mpirun -n 8,每个进程跑一个model partition;

int8 kv-cache和weight的int8量化, 可一起使用:

# Build model with both INT8 weight-only and INT8 KV cache enabled
python convert_checkpoint.py --model_dir ./llama-models/llama-7b-hf   \
                             --output_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
                             --dtype float16  \
                             --int8_kv_cache \
                             --use_weight_only \
                             --weight_only_precision int8

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
            --output_dir ./tmp/llama/7B/trt_engines/int8_kv_cache_weight_only/1-gpu \
            --gemm_plugin auto

(int8 kv-cache的calibration在哪步做的?)


要点:INT8计算借助了Tensor Cores; 

要点:INT8*INT8加和-->INT32(如果INT16,再加和就溢出了);默认是per-tensor,也就是X矩阵用一个数s_x做scaling factor, W矩阵用一个数s_w做scaling factor; 这2个数在推理阶段不会改变;

per-channel,是W矩阵每一列有一个scaling factor;per-token,是X矩阵每一行有一个scaling factor;per-channel的scaling factor向量在推理阶段不会变;per-token的scaling factor向量在推理阶段会改变;

python3 convert_checkpoint.py --model_dir /llama-models/llama-7b-hf  --output_dir /tmp/tllm_checkpoint_1gpu_sq --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_1gpu_sq \
             --output_dir ./engine_outputs \
             --gemm_plugin auto


# Build model for SmoothQuant in the _per_token_ + _per_channel_ mode
python3 convert_checkpoint.py --model_dir /llama-models/llama-7b-hf \
                            --output_dir /tmp/tllm_checkpoint_1gpu_sq \
                            --dtype float16 \
                            --smoothquant 0.5 \
                            --per_token \

trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_1gpu_sq \
             --output_dir ./engine_outputs \
             --gemm_plugin auto


--use_parallel_embedding: 将embedding值,parition放到不同GPU上,降低每个GPU的显存占用;


--use_embedding_sharing: 没看懂


主干模型不变,额外加上--lora_plugin 和 --lora_dir

python convert_checkpoint.py --model_dir Llama-2-13b-hf \
                         --output_dir ./tllm_checkpoint_2gpu \
                         --dtype float16 \
                         --tp_size 2

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu \
            --output_dir /tmp/new_lora_13b/trt_engines/fp16/2-gpu/ \
            --gemm_plugin auto \
            --lora_plugin auto \
            --max_batch_size 1 \
            --max_input_len 512 \
            --max_seq_len 562 \
            --lora_dir chinese-llama-2-lora-13b


mpirun -n 2 python ../run.py --engine_dir "/tmp/new_lora_13b/trt_engines/fp16/2-gpu/" \
              --max_output_len 50 \
              --tokenizer_dir "chinese-llama-2-lora-13b/" \
              --input_text "今天天气很好,我到公园的时候," \
              --lora_task_uids 0 \
              --no_add_special_tokens \






