当前位置: 首页 > article >正文

动手学习RAG: 大模型向量模型微调 intfloat/e5-mistral-7b-instruct

  • 动手学习RAG: 向量模型
  • 动手学习RAG: moka-ai/m3e 模型微调deepspeed与对比学习
  • 动手学习RAG:rerank模型微调实践 bge-reranker-v2-m3
  • 动手学习RAG:迟交互模型colbert微调实践 bge-m3
  • 动手学习RAG: 大模型向量模型微调 intfloat/e5-mistral-7b-instruct
  • 动手学习RAG:大模型重排模型 bge-reranker-v2-gemma微调

在这里插入图片描述

1. 环境准备

pip install transformers
pip install open-retrievals
  • 注意安装时是pip install open-retrievals,但调用时只需要import retrievals
  • 欢迎关注最新的更新 https://github.com/LongxingTan/open-retrievals

2. 使用Mistral作为向量模型

这里直接将query_instruction和document_instruction写进了text里

from retrievals import AutoModelForEmbedding

model_name = 'intfloat/e5-mistral-7b-instruct'
model = AutoModelForEmbedding.from_pretrained(
            model_name,
            pooling_method='last',
            use_fp16=True,
        )

texts = [
'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: how much protein should a female eat', 
'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: summit define', 
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.", 
'Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.'
]

embeds = model.encode(texts, normalize_embeddings=True)
print(embeds)

scores = (embeds[:2] @ embeds[2:].T) * 100
print(scores.tolist())

请添加图片描述

  • 也可以把prompt写在函数中
from retrievals import AutoModelForEmbedding

model_name = 'intfloat/e5-mistral-7b-instruct'
model = AutoModelForEmbedding.from_pretrained(
            model_name,
            pooling_method='last',
            use_fp16=True,
            query_instruction='Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ',
            document_instruction='',
        )


query_texts = ['how much protein should a female eat', 'summit define']
document_texts = ["As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.", 'Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.']

query_embeds = model.encode(query_texts, normalize_embeddings=True, is_query=True)
print(query_embeds)

doc_embeds = model.encode(document_texts, normalize_embeddings=True, is_query=False)
print(doc_embeds)

scores = (query_embeds @ doc_embeds.T) * 100
print(scores.tolist())

3. LoRA微调E5-mistral向量模型

数据还是按照惯例采用t2-ranking

MODEL_NAME="intfloat/e5-mistral-7b-instruct"
TRAIN_DATA="/root/kag101/src/open-retrievals/t2/t2_ranking.jsonl"
OUTPUT_DIR="/root/kag101/src/open-retrievals/t2/ft_out"


torchrun --nproc_per_node 1 \
  -m retrievals.pipelines.embed \
  --output_dir $OUTPUT_DIR \
  --overwrite_output_dir \
  --model_name_or_path $MODEL_NAME \
  --pooling_method last \
  --do_train \
  --data_name_or_path $TRAIN_DATA \
  --positive_key positive \
  --negative_key negative \
  --use_lora True \
  --query_instruction 'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ' \
  --document_instruction '' \
  --learning_rate 1e-5 \
  --bf16 \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 16 \
  --dataloader_drop_last True \
  --query_max_length 64 \
  --document_max_length 256 \
  --train_group_size 2 \
  --logging_strategy steps \
  --logging_steps 100 \
  --temperature 0.02 \
  --use_inbatch_negative false \
  --save_total_limit 1

请添加图片描述

由于trainer中可以使用多种方式使用多GPU,因此retrievals也都支持。

# torchrun --nnodes 1 --nproc-per-node 4
# deepspeed --include localhost:0,1,2,3
# CUDA_VISIBLE_DEVICES=1,2,3 python
# accelerate launch --config_file conf_ds.yaml \

accelerate launch \
    --config_file conf_llm.yaml \
    llm_finetune_for_embed.py \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --train_data  \
    --output_dir output \

4. 评测

微调前性能 c-mteb t2-ranking score
请添加图片描述

微调后性能

请添加图片描述
微调后,map从0.651上升到0.699,mrr从0.758上升到0.808


http://www.kler.cn/news/313489.html

相关文章:

  • [网络层]-IP协议相关特性
  • 记忆化搜索专题——算法简介力扣实战应用
  • JavaScript 与 Java 的继承有何区别?-----原型继承,单继承有何联系?
  • 微信小程序/uniapp 程序分包处理,小程序性能优化
  • 错题集锦之C语言
  • NumPy库学习之argmax函数
  • C++【类和对象】(一)
  • 数据结构--图
  • k8s的基础
  • YOLOv8改进,YOLOv8替换主干网络为VanillaNet( CVPR 2023 华为提出的全新轻量化架构),大幅度涨点
  • Remix在SPA模式下,出现ErrorBoundary错误页加载Ant Design组件报错,不能加载样式的问题
  • 使用注意力机制可以让你的模型更加灵活,但是需要额外的计算资源。rnn lstm bilstm attension
  • 【论文阅读】PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation
  • 开关磁阻电机(SRM)系统的matlab性能仿真与分析
  • python知识点100篇系列(17)-替换requests的python库httpx
  • Python学习
  • yolo自动化项目实例解析(四)ui页面整理1 (1.85)
  • git merge如何忽略部分路径
  • sqli-lab靶场学习(四)——Less11-14(post方法)
  • 微信小程序中的实时通讯:TCP/UDP 协议实现详解
  • Closure 是个数据结构
  • 如何在 Ubuntu 上安装 OpenSSH Server ?
  • DataFrame生成excel后为什么多了一行数字
  • 计算机的编程
  • 华为OD机试 - 信号强度(Python/JS/C/C++ 2024 E卷 100分)
  • 【设计模式】创建型模式(四):建造者模式
  • 前端设计之 主页面、书架页面、数据分析页面
  • 搜索引擎onesearch3实现解释和升级到Elasticsearch v8系列(二)-索引
  • 【RabbitMQ】死信队列、延迟队列
  • windows下用cmake编译腾讯云的对象存储COS的XML C++SDK