当前位置：首页 > article >正文

『不废话』之大模型性能评估

article 2025/3/31 10:04:33

前一篇『不废话』之大模型性能排行榜文章，梳理了如何知道业务最适配开源大模型。这次我们来谈谈，当我们微调完大模型之后，总是应该需要一套行之有效的性能评价体系来判断我们的微调是否有效。业界通常是考虑跟微调之前的基模进行比对，今天让我们从纯技术的角度来分析这个话题。

以下内容充分参考了网络资料，不仅适合这次的话题，同时也适合对比2个不同的模型性能。

第一步：定义你的比较目标

深入评估之前，明确以下关键问题：

提示：创建一个简单的评分规则，并加权重要性。

不同的基准测试衡量LLM不同的能力：

框架	地址
MMLU	https://huggingface.co/datasets/cais/mmlu
HELM	https://github.com/stanford-crfm/helm
BIG-Bench	https://github.com/google/BIG-bench
Winogrande	https://huggingface.co/datasets/allenai/ai2_arc

框架	地址
GSM8K	https://huggingface.co/datasets/openai/gsm8k
MATH	https://github.com/hendrycks/math
LogiQA	https://huggingface.co/datasets/lucasmccabe/logiqa
ai2 arc	https://huggingface.co/datasets/allenai/ai2_arc
HellaSwag	https://huggingface.co/datasets/Rowan/hellaswag

框架	地址
HumanEval	https://paperswithcode.com/sota/code-generation-on-humaneval
SWE-Bench	https://www.swebench.com/
APSS	https://arxiv.org/abs/2105.09938
MBPP	https://github.com/google-research/google-research/tree/master/mbpp
DS-1000	https://ds1000-code-gen.github.io/
BigCodeBench	https://github.com/bigcode-project/bigcodebench

框架	地址
TruthfulQA	https://github.com/sylinrl/TruthfulQA
FActScore	https://github.com/shmsw25/FActScore
DeepEval	https://github.com/confident-ai/deepeval
Opik	https://github.com/comet-ml/opik
RAGAs	https://github.com/explodinggradients/ragas
Deepchecks	https://github.com/deepchecks/deepchecks
Phoenix	https://github.com/Arize-ai/phoenix
Evalverse	https://github.com/evalplus/evalplus

框架	地址
Alpaca Eval	https://github.com/tatsu-lab/alpaca_eval
MT-Bench	https://github.com/mtbench101/mt-bench-101

框架	地址
Anthropic’s Red Teaming dataset	https://arxiv.org/abs/2209.07858
SafetyBench	https://github.com/thu-coai/SafetyBench

提示：专注于与您的特定用例相一致的基准测试，而不是试图测试所有内容。

确保在一致的测试条件下进行公平比较：

提示：创建一个配置文件，记录所有测试参数的重现性。

有几个框架可以帮助您自动化和标准化评估过程：

框架	最适合	安装
LangChain Evaluation	工作流测试	pip install langchain-eval
EleutherAI LM Evaluation Harness	学术基准	pip install lm-eval
DeepEval	单元测试	pip install deepeval
Promptfoo	即时比较	npm install -g promptfoo npm install -gnpm
TruLens	反馈分析	pip install trulens-eval