Unsloth 大模型微调工具与 llama.cpp 量化推理库简介及其预训练操作方法
Unsloth 大模型微调工具简介
Unsloth 是一个用于大规模预训练模型(如 Llama)微调的高效工具,主要应用于加速模型量化、转换和微调的过程。它通过集成多个先进的技术优化模型训练和部署流程,旨在帮助开发者简化复杂的深度学习模型微调工作,同时提高效率。
项目地址:https://github.com/unslothai/unsloth
以下是 Unsloth 工具的主要特点、使用场景以及解决的一些常见问题。
主要功能和特点
- 支持多种量化方法 Unsloth 支持将模型从原始的浮动精度(FP32)转换为低精度(如 BF16、QLoRA 和 Q4 等),以减少内存占用,并加速模型推理和训练过程。量化是大模型部署中的一个关键步骤,Unsloth 提供的量化方法包括:
q4_k_m
:一种有效的量化方法,用于在保持较高精度的情况下,压缩模型大小和减少计算开销。bf16
:另一种常见的量化精度,主要在训练阶段使用,以提升性能。
- 加速转换过程 Unsloth 提供了快速转换机制,能够在处理大规模模型(例如从 Hugging Face 模型格式到 GGUF 格式的转换)时,显著提升转换速度。它通过集成 GPU 加速和优化的模型转换工具,大大缩短了模型从一种格式到另一种格式的转换时间。
- 支持 Llama 等大模型 Unsloth 主要面向 Llama 及类似的预训练大模型,尤其适用于那些需要大量计算资源和长时间训练的大型语言模型。它提供了高效的计算框架,帮助用户在有限的硬件资源下完成大规模模型的微调和量化。
- 集成了
llama.cpp
Unsloth 与llama.cpp
紧密集成,llama.cpp
是一个由 Georgi Gerganov 提供的 C++ 实现的 Llama 模型推理引擎,它在性能上比其他实现要高效得多,特别是在 CPU 上执行时表现优越。Unsloth 利用这个引擎来加速模型的推理过程和量化过程。
使用场景
- 快速量化与转换 在模型微调过程中,通常会使用更低精度的表示来减少内存占用。Unsloth 可以帮助开发者快速将大规模模型转换为低精度(如 BF16 或 Q4),以加速推理速度并降低硬件资源需求。
- 加速模型微调 Unsloth 支持通过简化模型微调的过程,帮助开发者快速将预训练模型微调到自己的数据集上。它集成了先进的 GPU 加速技术,使得大规模模型微调的训练过程变得更加高效。
- 支持大规模并行训练 Unsloth 支持大规模并行训练任务,适用于需要多个GPU来进行大规模数据处理和模型训练的场景。它通过优化的并行计算框架,最大化硬件资源的利用率。
量化推理库llama.cpp
llama.cpp
是一个纯 C/C++ 实现的 Meta LLaMA 模型(及其他模型)的推理库。它旨在通过最小的设置和最先进的性能在各种硬件上进行本地和云端的 LLM 推理。
项目地址:https://github.com/ggerganov/llama.cpp
以下是该项目的一些关键点:
主要特点
- 纯 C/C++ 实现:无任何依赖项
- 苹果芯片支持:通过 ARM NEON、Accelerate 和 Metal 框架进行优化
- x86 支持:支持 AVX、AVX2、AVX512 和 AMX
- 整数量化:支持 1.5-bit 到 8-bit 的整数量化,以提高推理速度和减少内存使用
- CUDA 内核:支持 NVIDIA GPU 的自定义 CUDA 内核(通过 HIP 支持 AMD GPU,通过 MUSA 支持 Moore Threads MTT GPU)
- 多后端支持:包括 Vulkan 和 SYCL 后端
- CPU+GPU 混合推理:对超过 VRAM 容量的模型进行部分加速
支持的模型
llama.cpp
支持多种模型,包括但不限于:
- LLaMA 系列
- Mistral 7B
- Chinese LLaMA / Alpaca
- BERT
- GPT-2
- Flan T5
- RWKV-6
附件项目详细内容:
Finetune Llama 3.3, Mistral, Phi-4, Qwen 2.5 & Gemma 2-5x faster with 80% less memory!
✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click “Run All”, and you’ll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.
Unsloth supports | Free Notebooks | Performance | Memory use |
---|---|---|---|
Llama 3.2 (3B) | ▶️ Start for free | 2x faster | 60% less |
Phi-4 | ▶️ Start for free | 2x faster | 50% less |
Llama 3.2 Vision (11B) | ▶️ Start for free | 2x faster | 40% less |
Llama 3.1 (8B) | ▶️ Start for free | 2x faster | 60% less |
Gemma 2 (9B) | ▶️ Start for free | 2x faster | 63% less |
Qwen 2.5 (7B) | ▶️ Start for free | 2x faster | 63% less |
Mistral v0.3 (7B) | ▶️ Start for free | 2.2x faster | 73% less |
Ollama | ▶️ Start for free | 1.9x faster | 43% less |
ORPO | ▶️ Start for free | 1.9x faster | 43% less |
DPO Zephyr | ▶️ Start for free | 1.9x faster | 43% less |
- See all our notebooks and all our models
- Kaggle Notebooks for Llama 3.2 Kaggle notebook, Llama 3.1 (8B), Gemma 2 (9B), Mistral (7B)
- Run notebooks for Llama 3.2 conversational, Llama 3.1 conversational and Mistral v0.3 ChatML
- This text completion notebook is for continued pretraining / raw text
- This continued pretraining notebook is for learning another language
- Click here for detailed documentation for Unsloth.
🦥 Unsloth.ai News
- 📣 NEW! Phi-4 by Microsoft is now supported. We also fixed bugs in Phi-4 and uploaded GGUFs, 4-bit. Try the Phi-4 Colab notebook
- 📣 NEW! Llama 3.3 (70B), Meta’s latest model is supported.
- 📣 NEW! We worked with Apple to add Cut Cross Entropy. Unsloth now supports 89K context for Meta’s Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support.
- 📣 NEW! Introducing Unsloth Dynamic 4-bit Quantization! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on Hugging Face here.
- 📣 NEW! Vision models now supported! Llama 3.2 Vision (11B), Qwen 2.5 VL (7B) and Pixtral (12B) 2409
- 📣 NEW! Qwen-2.5 including Coder models are now supported with bugfixes. 14b fits in a Colab GPU! Qwen 2.5 conversational notebook
- 📣 NEW! We found and helped fix a gradient accumulation bug! Please update Unsloth and transformers.
- 📣 Try out Chat interface!
- 📣 NEW! Mistral Small 22b notebook finetuning fits in under 16GB of VRAM!
- 📣 NEW! Llama 3.1 8b, 70b & Mistral Nemo-12b both Base and Instruct are now supported
- 📣 NEW!
pip install unsloth
now works! Head over to pypi to check it out! This allows non git pull installs. Usepip install unsloth[colab-new]
for non dependency installs. - 📣 NEW! Continued Pretraining notebook for other languages like Korean!
- 📣 2x faster inference added for all our models
- 📣 We cut memory usage by a further 30% and now support 4x longer context windows!
🔗 Links and Resources
Type | Links |
---|---|
📚 Documentation & Wiki | Read Our Docs |
Twitter (aka X) | Follow us on X |
💾 Installation | unsloth/README.md |
🥇 Benchmarking | Performance Tables |
🌐 Released Models | Unsloth Releases |
✍️ Blog | Read our Blogs |
Join our Reddit page |
⭐ Key Features
- All kernels written in OpenAI’s Triton language. Manual backprop engine.
- 0% loss in accuracy - no approximation methods - all exact.
- No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
- Works on Linux and Windows via WSL.
- Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
- Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
- If you trained a model with 🦥Unsloth, you can use this cool sticker!
🥇 Performance Benchmarking
- For our most detailed benchmarks, read our Llama 3.3 Blog.
- Benchmarking of Unsloth was also conducted by 🤗Hugging Face.
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):
Model | VRAM | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 |
---|---|---|---|---|---|
Llama 3.3 (70B) | 80GB | 2x | >75% | 13x longer | 1x |
Llama 3.1 (8B) | 80GB | 2x | >70% | 12x longer | 1x |
💾 Installation Instructions
For stable releases, use pip install unsloth
. We recommend pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
for most installations though.
Conda Installation
⚠️Only use Conda if you have it. If not, use Pip
. Select either pytorch-cuda=11.8,12.1
for CUDA 11.8 or CUDA 12.1. We support python=3.10,3.11,3.12
.
conda create --name unsloth_env \
python=3.11 \
pytorch-cuda=12.1 \
pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
-y
conda activate unsloth_env
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
If you're looking to install Conda in a Linux environment,
read here, or run the below 🔽
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
Pip Installation
⚠️Do **NOT** use this if you have Conda.
Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5
and CUDA versions.
For other torch versions, we support torch211
, torch212
, torch220
, torch230
, torch240
and for CUDA versions, we support cu118
and cu121
and cu124
. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere
or cu121-ampere
or cu124-ampere
.
For example, if you have torch 2.4
and CUDA 12.1
, use:
pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
Another example, if you have torch 2.5
and CUDA 12.4
, use:
pip install --upgrade pip
pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
And other examples:
pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
Or, run the below in a terminal to get the optimal pip installation command:
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
Or, run the below manually in a Python REPL:
try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
v = V(torch.__version__)
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not supported!")
if v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v < V('2.3.0'): x = 'cu{}{}-torch220'
elif v < V('2.4.0'): x = 'cu{}{}-torch230'
elif v < V('2.5.0'): x = 'cu{}{}-torch240'
elif v < V('2.6.0'): x = 'cu{}{}-torch250'
else: raise RuntimeError(f"Torch = {v} too new!")
x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')
Windows Installation
To run Unsloth directly on Windows:
- Install Triton from this Windows fork and follow the instructions: https://github.com/woct0rdho/triton-windows
- In the SFTTrainer, set
dataset_num_proc=1
to avoid a crashing issue:
trainer = SFTTrainer(
dataset_num_proc=1,
...
)
For advanced installation instructions or if you see weird errors during installations:
- Install
torch
andtriton
. Go to https://pytorch.org to install it. For examplepip install torch torchvision torchaudio triton
- Confirm if CUDA is installated correctly. Try
nvcc
. If that fails, you need to installcudatoolkit
or CUDA drivers. - Install
xformers
manually. You can try installingvllm
and seeing ifvllm
succeeds. Check ifxformers
succeeded withpython -m xformers.info
Go to https://github.com/facebookresearch/xformers. Another option is to installflash-attn
for Ampere GPUs. - Finally, install
bitsandbytes
and check it withpython -m bitsandbytes
📜 Documentation
- Go to our official Documentation for saving to GGUF, checkpointing, evaluation and more!
- We support Huggingface’s TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
- We’re in 🤗Hugging Face’s official docs! Check out the SFT docs and DPO docs!
- If you want to download models from the ModelScope community, please use an environment variable:
UNSLOTH_USE_MODELSCOPE=1
, and install the modelscope library by:pip install modelscope -U
.
unsloth_cli.py also supports
UNSLOTH_USE_MODELSCOPE=1
to download models and datasets. please remember to use the model and dataset id in the ModelScope community.
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates
DPO Support
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We’re in 🤗Hugging Face’s official docs! We’re on the SFT docs and the DPO docs!
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()
🥇 Detailed Benchmarking Tables
Context length benchmarks
Llama 3.1 (8B) max. context length
We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 |
---|---|---|
8 GB | 2,972 | OOM |
12 GB | 21,848 | 932 |
16 GB | 40,724 | 2,551 |
24 GB | 78,475 | 5,789 |
40 GB | 153,977 | 12,264 |
48 GB | 191,728 | 15,502 |
80 GB | 342,733 | 28,454 |
Llama 3.3 (70B) max. context length
We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 |
---|---|---|
48 GB | 12,106 | OOM |
80 GB | 89,389 | 6,916 |
Citation
You can cite the Unsloth repo as follows:
@software{unsloth,
author = {Daniel Han, Michael Han and Unsloth team},
title = {Unsloth},
url = {http://github.com/unslothai/unsloth},
year = {2023}
}
Thank You to
- Erik for his help adding Apple’s ML Cross Entropy in Unsloth
- HuyNguyen-hust for making RoPE Embeddings 28% faster
- RandomInternetPreson for confirming WSL support
- 152334H for experimental DPO support
- atgctg for syntax highlighting
参考环境配置
https://pytorch.org/get-started/previous-versions/
https://mirrors.tuna.tsinghua.edu.cn/help/pypi/
https://blog.csdn.net/Irving_zhang/article/details/79087569
https://huggingface.co/microsoft/Phi-3.5-MoE-instruct
https://blog.csdn.net/qq_40277409/article/details/138566289
参考bug解决教程
https://github.com/ggerganov/llama.cpp/issues/1344
https://github.com/ggerganov/llama.cpp/issues/8107
https://github.com/unslothai/unsloth/issues/748
https://www.cnblogs.com/scarecrow-blog/p/17875042.html
https://blog.csdn.net/qq_44297664/article/details/138128873
https://blog.csdn.net/qq_41185868/article/details/138551800
模型工具教程参考
参考:https://www.bilibili.com/opus/930602495462342692
https://docs.loopin.network/tutorials/LLM/llama3-finetune
https://blog.csdn.net/qq_45689158/article/details/138800033
参考:https://blog.csdn.net/shebao3333/article/details/142734959
https://huggingface.co/datasets/kigner/ruozhiba-llama3-tt
https://news.miracleplus.com/share_link/24020
https://huggingface.co/unsloth/llama-3-8b-bnb-4bit/tree/main
https://medium.com/@kushalvala/fine-tuning-large-language-models-with-unsloth-380216a76108
参考:https://blog.csdn.net/u012856866/article/details/140955316
参考:http://www.hubwiz.com/blog/unsloth-concise-tutorial-on-llm-fine-tuning/
https://www.skycaiji.com/aigc/ai22501.html
Unsloth + Llama 3本机微调大模型指南 | 安全快速训练个性化模型:https://studywithlarry.com/unsloth-llama-3/