当前位置：首页 > article >正文

使用llama.cpp进行量化和部署

article 2024/11/27 3:05:49

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

🖥️ CPU 版本

cmake -B build_cpu
cmake --build build_cpu --config Release

🖥️ CUDA 版本

cmake -B build_cuda -DLLAMA_CUDA=ON
cmake --build build_cuda --config Release -j 12

cmake -B build
cmake --build build --config Release -t llama-server

量化

1.将 safetensors 格式转成 gguf

cd ~/code/llama.cpp/build_cuda/bin

python convert-hf-to-gguf.py /mnt/workspace/Qwen2.5-7B-Instruct --outfile /mnt/workspace/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-q8_0-v1.gguf --outtype q8_0

2.将 gguf 格式进行（再）量化

cd ~/code/llama.cpp/build_cuda/bin

./quantize --allow-requantize /root/autodl-tmp/models/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q8_0-v2_1.gguf /root/autodl-tmp/models/Llama3-8B-Chinese-Chat-GGUF/Llama3-8B-Chinese-Chat-q4_1-v1.gguf Q4_1

部署服务：

cd llama.cpp/build/bin

./llama-server -m /mnt/workspace/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-q8_0-v1.gguf/Qwen2.5-7B-Instruct-Q8_0.gguf --port 8080

http://www.kler.cn/a/410804.html

相关文章：

Redis 可观测最佳实践

C++设计模式-策略模式-StrategyMethod

部署实战(二)--修改jar中的文件并重新打包成jar文件

ollama教程——在Linux上运行大型语言模型的完整指南

STM32编程小工具FlyMcu和STLINK Utility 《通俗易懂》破解

ChatGPT 桌面版发布了，如何安装？

自由学习记录（23）

windows 中docker desktop 安装

uni-app自定义底部tab并且根据字段显示和隐藏

设计模式——空对象模式

如何不使用密码，通过ssh直接登录服务器

【Python】九大经典排序算法：从入门到精通的详解（冒泡排序、选择排序、插入排序、归并排序、快速排序、堆排序、计数排序、基数排序、桶排序）

第二十天模型评估与调优

LeetCode 872.叶子相似的树

DevExpress WinForms中文教程：Data Grid - 使用服务器模式的大数据源和即时反馈？

在线课程管理：SpringBoot技术的应用

wordpress获取文章总数、分类总数、tag总数等

解决 Android 单元测试 No tests found for given includes:

【运维】使用 shell 脚本实现类似 jumpserver 效果实现远程登录linux 服务器

Android数据存储——文件存储、SharedPreferences、SQLite、Litepal

sklearn学习

Golang 调用 mongodb 的函数

C++定义函数指针变量作为形参

JS的DOM操作和事件监听综合练习（具备三种功能的轮播图案例）

【MySQL】MySQL从入门到放弃

一款开源在线项目任务管理工具