当前位置: 首页 > article >正文



    • (一)RAG中图像处理策略
    • (二)LangChain中使用及数据加载
      • (1)依赖包
      • (2)数据加载
      • (3)Multi-vector retriever 多向量检索器
      • (4)Image summaries 图像摘要
      • (5)Add to vectorstore 添加到矢量存储
      • (6)构建RAG检索器
      • (7)Check 检查
      • (8)Sanity Check 健全性检查
      • (9)运行RAG并测试
      • (10)Considerations 考虑



  • 方案1:
    • 使用多模态嵌入(例如 CLIP)来嵌入图像和文本
    • 使用相似性搜索检索图像和文本
    • 将原始图像和文本片段传递给多模态 LLM 进行答案合成
  • 方案2:
    • 使用多模态 LLM(例如 GPT-4V、LLaVA 等)从图像生成文本摘要
    • 嵌入并检索文本
    • 将文本片段传递给 LLM 进行答案合成
  • 方案3:
    • 使用多模态 LLM(例如 GPT-4V、LLaVA等)直接从图像和文本中生成答案
    • 使用引用原始图像嵌入和检索图像摘要
    • 将原始图像和文本块传递到多模态LLM以进行答案合成


  • 使用非结构化来解析文档 (PDF) 中的图像、文本和表格。
  • 使用带有 Chroma 的多向量检索器来存储原始文本和图像以及它们的摘要以供检索。
  • 使用 GPT-4V 进行图像摘要(用于检索)以及通过图像和文本(或表格)的联合审查进行最终答案合成。

该方案2 适用于LLM无法使用多模态进行答案综合的情况(例如,成本等)。




除了以下pip包外,您还需要系统中的 poppler (安装说明)和 tesseract (安装说明)。

! pip install -U langchain openai chromadb langchain-experimental # (newest versions required for multi-modal)

! pip install "unstructured[all-docs]" pillow pydantic lxml pillow matplotlib chromadb tiktoken


对 PDF 表格、文本和图像进行分区


我们用它来 Unstructured 对它进行分区。

要跳过 Unstructured 提取:

这是一个 zip 文件,其中包含提取的图像和 pdf 的子集。

如果您想使用提供的文件夹,那么只需为文档选择 pdf 加载器:

from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(path + fname)
docs = loader.load()
tables = [] # Ignore w/ basic pdf loader
texts = [d.page_content for d in docs]
from langchain_text_splitters import CharacterTextSplitter
from unstructured.partition.pdf import partition_pdf

# Extract elements from PDF

def extract_pdf_elements(path, fname):
    Extract images, tables, and chunk text from a PDF file.
    path: File path, which is used to dump images (.jpg)
    fname: File name
    return partition_pdf(
        filename=path + fname,

# Categorize elements by type

def categorize_elements(raw_pdf_elements):
    Categorize extracted elements from a PDF into tables and texts.
    raw_pdf_elements: List of unstructured.documents.elements
    tables = []
    texts = []
    for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
        elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
    return texts, tables

# File path

fpath = "/Users/rlm/Desktop/cj/"
fname = "cj.pdf"

# Get elements

raw_pdf_elements = extract_pdf_elements(fpath, fname)

# Get text, tables

texts, tables = categorize_elements(raw_pdf_elements)

# Optional: Enforce a specific token size for texts

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=4000, chunk_overlap=0
joined_texts = " ".join(texts)
texts_4k_token = text_splitter.split_text(joined_texts)

(3)Multi-vector retriever 多向量检索器


Text and Table summaries 文本和表格摘要
我们将使用 GPT-4 来生成表格和可选的文本摘要。

如果使用较大的块大小(例如,如上所述,我们使用 4k 令牌块),则建议使用文本摘要。


from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Generate summaries of text elements

def generate_text_summaries(texts, tables, summarize_texts=False):
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. These summaries will be embedded and used to retrieve the raw text or table elements. Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """
    prompt = ChatPromptTemplate.from_template(prompt_text)

    # Text summary chain
    model = ChatOpenAI(temperature=0, model="gpt-4")
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts and summarize_texts:
        text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})
    elif texts:
        text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

    return text_summaries, table_summaries

# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts_4k_token, tables, summarize_texts=True)

(4)Image summaries 图像摘要

我们将使用 GPT-4V 来生成图像摘要。

此处的 API 文档:

  • 传递 base64 编码的图像
import base64
import os

from langchain_core.messages import HumanMessage

def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def image_summarize(img_base64, prompt):
    """Make image summary"""
    chat = ChatOpenAI(model="gpt-4-vision-preview", max_tokens=1024)
    msg = chat.invoke(
                    {"type": "text", "text": prompt},
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
	return msg.content
def generate_img_summaries(path):
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval."""

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpg"):
            img_path = os.path.join(path, img_file)
            base64_image = encode_image(img_path)
            image_summaries.append(image_summarize(base64_image, prompt))

    return img_base64_list, image_summaries

# Image summaries
img_base64_list, image_summaries = generate_img_summaries(fpath)

(5)Add to vectorstore 添加到矢量存储

将原始文档和文档摘要添加到 Multi Vector Retriever:

  • 将原始文本、表格和图像存储在 docstore .
  • 将文本、表格摘要和图像摘要存储在 中 vectorstore,以便进行有效的语义检索。
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

def create_multi_vector_retriever(
    vectorstore, text_summaries, texts, table_summaries, tables, image_summaries, images
    Create retriever that indexes summaries, but returns raw images or texts
    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))

    # Add texts, tables, and images
    # Check that text_summaries is not empty before adding
    if text_summaries:
        add_documents(retriever, text_summaries, texts)
    # Check that table_summaries is not empty before adding
    if table_summaries:
        add_documents(retriever, table_summaries, tables)
    # Check that image_summaries is not empty before adding
    if image_summaries:
        add_documents(retriever, image_summaries, images)

    return retriever

# The vectorstore to use to index the summaries

vectorstore = Chroma(
    collection_name="mm_rag_cj_blog", embedding_function=OpenAIEmbeddings())

# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(


我们需要将检索到的文档装箱到 GPT-4V 提示模板的正确部分。

import io
import re

from IPython.display import HTML, display
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from PIL import Image

def plt_img_base64(img_base64):
    """Disply base64 encoded string as image"""
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML

def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None

def is_image_data(b64data):
    Check if the base64 data is an image by looking at the start of the data
    image_signatures = {
        b"\xff\xd8\xff": "jpg",
        b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False

def resize_base64_image(base64_string, size=(128, 128)):
    Resize an image encoded as a Base64 string
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))
    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")
def split_image_text_types(docs):
    Split base64-encoded images and texts
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            doc = resize_base64_image(doc, size=(1300, 600))
    return {"images": b64_images, "texts": texts}

def img_prompt_func(data_dict):
    Join the context into a single string
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []
    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},

    # Adding the text for analysis
    text_message = {
        "type": "text",
        "text": (
            "You are financial analyst tasking with providing investment advice.\n"
            "You will be given a mixed of text, tables, and image(s) usually of charts or graphs.\n"
            "Use this information to provide investment advice related to the user question. \n"
            f"User-provided question: {data_dict['question']}\n\n"
            "Text and / or tables:\n"
    return [HumanMessage(content=messages)]

def multi_modal_rag_chain(retriever):
    Multi-modal RAG chain
    # Multi-modal LLM
    model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=1024)

    # RAG pipeline
    chain = (
            "context": retriever | RunnableLambda(split_image_text_types),
            "question": RunnablePassthrough(),
        | RunnableLambda(img_prompt_func)
        | model
        | StrOutputParser())

    return chain

# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)

(7)Check 检查


# Check retrieval
query = "Give me company names that are interesting investments based on EV / NTM and NTM rev growth. Consider EV / NTM multiples vs historical?"
docs = retriever_multi_vector_img.invoke(query, limit=6)

# We get 4 docs
# Check retrieval
query = "What are the EV / NTM and NTM rev growth for MongoDB, Cloudflare, and Datadog?"
docs = retriever_multi_vector_img.invoke(query, limit=6)

# We get 4 docs
# We get back relevant images


(8)Sanity Check 健全性检查





这张图片确实是从我们 query 这里检索的,因为它与这个摘要的相似性,这是非常合理的。



现在让我们运行 RAG 并测试合成问题答案的能力。

# Run RAG chain

这是跟踪,我们可以看到传递给 LLM:

  • 问题 1:Trace 专注于投资建议
  • 问题 2:跟踪侧重于表提取

对于问题 1,我们可以看到我们传递了 3 个图像和一个文本块:


(10)Considerations 考虑

  • Retrieval 检索
    • 检索是根据与图像摘要和文本块的相似性来执行的。
    • 这需要仔细考虑,因为如果存在竞争的文本块,图像检索可能会失败。
    • 为了缓解这种情况,我生成了更大的(4k 令牌)文本块,并将它们汇总起来以供检索。
  • Image Size 图像尺寸
    • 正如预期的那样,答案合成的质量似乎对图像大小很敏感。
    • 我将很快进行评估,以更仔细地测试这一点。



  • 除了混合搜索,RAG 还需要哪些基础设施能力
  • 【小白学机器学习37】用numpy计算协方差cov(x,y) 和 皮尔逊相关系数 r(x,y)
  • 微信小程序蓝牙writeBLECharacteristicValue写入数据返回成功后,实际硬件内信息查询未存储?
  • EXISTS 和 IN 的使用方法、特性及查询效率比较
  • 开发中使用UML的流程_05 PIM-1:分析系统流程
  • QChart数据可视化
  • Vue 3 Teleport 教程
  • Epipolar-Free 3D Gaussian Splatting for Generalizable Novel View Synthesis 论文解读
  • 【接口封装】——7、连接并使用 MySQL 数据库
  • 统计词频
  • 深入解析:用Scala验证身份证号码的合法性
  • C++ 中的函数对象
  • API安全
  • Ubuntu EFI分区扩容
  • C# 索引器 详解(含对照例子)
  • “harmony”整合不同平台的单细胞数据之旅
  • RabbitMQ 集群
  • Qt中QSpinBox valueChanged 信号触发两次
  • EtherCAT Coe对象创建与通信
  • 爬取的数据如何有效进行数据分析?