当前位置：首页 > article >正文

wow-rag学习｜手搓RAG

article 2025/3/13 21:56:28

本文为Datawhale 开源项目wow-rag的学习笔记与分享，仅供参考
开源项目：链接

免费获取大模型APIKEY

这里使用的是智谱的API,可以通过下面的连接注册一个账号，并且新用户会赠送200万token

https://open.bigmodel.cn/usercenter/proj-mgmt/apikeys

在这里插入图片描述

安装依赖库

pip install faiss-cpu scikit-learn scipy 
pip install openai 
pip install python-dotenv

构造Client

在构造Client之前需要准备好四样前菜：

第一：一个api_key，这个需要到各家的开放平台上去申请。
第二：一个base_url，这个需要到各家的开放平台上去拷贝。
第三：他们家的对话模型名称。
第四：他们家的嵌入模型名称。
国内模型可以是智谱、Yi、千问deepseek等等。KIMI是不行的，因为Kimi家没有嵌入模型。

在这里就是以智谱为例，在项目的根目录创建一个txt文件，把文件名改成.env，然后在里面填入一行字符串： ZHIPU_API_KEY=你的api_key
这样的目的是为了保密，同时方便在不同的代码中读取。dotenv中的dot就是点的意思，所以dotenv=.env

import os 
from dotenv import load_dotenv

# 加载环境变量
load_dotenv()

#加载环境变量
api_key=os.getenv('ZHIPU_API_KEY')
base_url="https://open.bigmodel.cn/api/paas/v4"
chat_model="glm-4-flash"
emb_model="embedding-2"

from openai import OpenAI
client = OpenAI(
    api_key = api_key,
    base_url = base_url
)

构造好这个client之后，就可以去实现各种能力了。

构造文档

由于RAG的原理是先在文档中搜索，把搜索到最接近的内容喂给大模型，让大模型根据喂给它的内容进行回答，因此需要存储文档块，便于检索。这需要队文章进行切分后存入到数据库。我们选取了来自AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION的部分文章段落并进行嵌入。由于文章太长，我们先要对文章进行切分。在这里，我们使用没有任何优化的顺序切分器，将文章分成了 150 个字符一段的小文本块。

embedding_text = """
Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario. Fundamental advances in agent AI help contribute towards these goals and many would benefit from a greater understanding of how to model embodied and empathetic in a simulate reality or a real world. Arguably many of these applications could have positive benefits.

However, this technology could also be used by bad actors. Agent AI systems that generate content can be used to manipulate or deceive people. Therefore, it is very important that this technology is developed in accordance with responsible AI guidelines. For example, explicitly communicating to users that content is generated by an AI system and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to detect manipulative content - partly because it is rich with hallucination performance of large foundation model - and thus help address another real world problem.

For examples, 1) in health topic, ethical deployment of LLM and VLM agents, especially in sensitive domains like healthcare, is paramount. AI agents trained on biased data could potentially worsen health disparities by providing inaccurate diagnoses for underrepresented groups. Moreover, the handling of sensitive patient data by AI agents raises significant privacy and confidentiality concerns. 2) In the gaming industry, AI agents could transform the role of developers, shifting their focus from scripting non-player characters to refining agent learning processes. Similarly, adaptive robotic systems could redefine manufacturing roles, necessitating new skill sets rather than replacing human workers. Navigating these transitions responsibly is vital to minimize potential socio-economic disruptions.

Furthermore, the agent AI focuses on learning collaboration policy in simulation and there is some risk if directly applying the policy to the real world due to the distribution shift. Robust testing and continual safety monitoring mechanisms should be put in place to minimize risks of unpredictable behaviors in real-world scenarios. Our “VideoAnalytica" dataset is collected from the Internet and considering which is not a fully representative source, so we already go through-ed the ethical review and legal process from both Microsoft and University Washington. Be that as it may, we also need to understand biases that might exist in this corpus. Data distributions can be characterized in many ways. In this workshop, we have captured how the agent level distribution in our dataset is different from other existing datasets. However, there is much more than could be included in a single dataset or workshop. We would argue that there is a need for more approaches or discussion linked to real tasks or topics and that by making these data or system available.

We will dedicate a segment of our project to discussing these ethical issues, exploring potential mitigation strategies, and deploying a responsible multi-modal AI agent. We hope to help more researchers answer these questions together via this paper.

"""

# 设置每个文本块的大小为150个字符
chunk_size=150

#使用列表推导式将长文本分割成多个块，每个块的大小为chunk_size
chunks=[embedding_text[i:i + chunk_size] for i in range(0,len(embedding_text),chunk_size)]

在这里插入图片描述
最后可以得到23个文本块

向量化

把每一个文档进行向量化，通过对比向量的余弦相似度，可以找到与提出问题最接近的文档片段。接着，我们将这些小文本块进行Embedding，得到一个1024维度的向量。向量化我们需要使用前面的emb_model,然后将这些向量存入到一个向量数据库中，以便后续进行检索。

from sklearn.preprocessing import normalize
import numpy as py
import faiss

# 初始化一个空列表来存储每个文本块的嵌入向量
embeddings=[]

# 遍历每个文本块
for chunk in chunks:
    # 使用OpenAI API为当前文本块创建嵌入向量
    response=client.embeddings.create(
        model=emb_model,
        input=chunk,
    )
    # 将嵌入向量添加到列表中
    embeddings.append(response.data[0].embedding)

在这里插入图片描述
嵌入向量的长度为1024。
接下来对嵌入向量进行归一化处理，并且存入一个向量数据库中

# 使用sklearn的normalize函数队嵌入向量进行归一化处理
normalize_embeddings = normalize(np.array(embeddings).astype('float32'))

# 获取嵌入向量的维度
d=len(embeddings[0])

# 创建一个Faiss索引，用于存储和检索嵌入向量
index = faiss.IndexFlatIP(d)

# 将归一化后的嵌入向量添加到索引中
index.add(normalize_embeddings)

# 获取索引中的向量总数
n_vectors=index.ntotal

print("索引中的向量总数:",n_vectors)

可以看到23块文本块转化成了23个向量。

检索

我们可以使用向量数据库进行检索，向量检索同样需要用到前面的emb_model。下面代码实现了一个名为match_text的函数，其目的是在一个文本集合中找到与给输入文本最相似的文本块。其中k是要返回的相似文本块的数量.

from sklearn.preprocessing import normalize
def match_text(input_text,index,chunks,k=2):
    """
    在给定的文本块集合中，找到与输入文本最相似的前k个文本块

    参数：
        input_text(str):要匹配的输入文本。
        index(faiss.index):用于搜索的Faiss索引
        chunks(list or str):文本块的列表。
        k(int,optional):要返回的最相似文本块的数量。默认值为2.
    返回：
        str:格式化的字符串，包括最相似的文本块及其相似度
    """
    # 确保k不超过文本块的总数
    k = min(k,len(chunks))

    # 使用OpenAI API为输入文本创建嵌入向量
    response=client.embeddings.create(
        model=emb_model,
        input=input_text,
    )
    # 获取输入文本的嵌入向量
    input_embeding=response.data[0].embedding
    # 对输入嵌入向量进行归一化处理
    input_embeding=normalize(np.array([input_embeding]).astype('float32'))

    # 在索引中搜索与输入嵌入向量最相似的k个向量
    distances,indices = index.search(input_embeding,k)
    # 初始化一个字符串来存储匹配的文本
    matching_texts=""
    # 遍历所有结果
    for i,idx in enumerate(indices[0]):
        # 打印每个匹配文本块的相似度和文本内容
        print(f"similarity:{distances[0][i]:.4f}\nmatching text:\n{chunks[idx]}\n")
        # 将相似度和文本内容添加到匹配文本字符串中
        matching_texts+=f"similarity:{distances[0][i]:.4f}\nmatching text:\n{chunks[idx]}\n"

    # 返回包含匹配文本块及其相似度的字符串
    return matching_texts

input_text = "What are the applications of Agent AI systems ?"

matched_texts = match_text(input_text=input_text, index=index, chunks=chunks, k=2)

可以看到检索的结果与相似度
在这里插入图片描述

生成

构造Promt

prompt = f"""
根据找到的文档
{matched_texts}
生成
{input_text}
的答案，尽可能使用文档语句的原文回答。不要复述问题，直接开始回答。
"""

流式生成AI对话回复

def get_completion_stream(prompt):
    """
    使用 OpenAI 的 Chat Completions API 生成流式的文本回复。

    参数:
        prompt (str): 要生成回复的提示文本。

    返回:
        None: 该函数直接打印生成的回复内容。
    """
    # 使用OpenAI 的Chat Completions API 创建一个聊天完成请求
    response=client.chat.completions.create(
        model=chat_model,
        messages=[
            {"role":"user","content":prompt},
        ],
        stream=True, 
        # stream=True会使API分批次返回文本片段（chunk）
		# 每个chunk约包含3-5个单词的增量内容
        # 相比非流式模式，流式传输的延迟更低（首个token更快到达）
    )
    # 流式响应处理
    #如果响应存在
    if response:
        # 遍历API返回的响应流
        for chunk in response:
            # 提取增量内容
            content=chunk.choices[0].delta.content
            #如果内容存在,过滤空内容
            if content:
                #打印内容，并刷新输出缓冲区
                print(content,end='',flush=True)
get_completion_stream(prompt)