微软开源GraphRAG的使用教程(最全,非常详细)
GraphRAG的介绍
目前微软已经开源了GraphRAG的完整项目代码。对于某一些LLM的下游任务则可以使用GraphRAG去增强自己业务的RAG的表现。项目给出了两种使用方式:
- 在打包好的项目状态下运行,可进行尝试使用。
- 在源码基础上运行,适合为了下游任务的微调时使用。
如果需要利用Ollama部署本地大模型的可以参考我的另一篇博客
以下在通过自身的实践之后的给出对这两种方式的使用教程,如果还有什么问题在评论区交流。
一、在源码基础上运行(便于后续修改)
1. 准备环境(在终端运行)
(1)创建虚拟环境(已安装好anaconda),此处建议使用python3.11:
conda create -n GraphRAG python=3.11
conda activate GraphRAG
2. 下载源码并进入目录
git clone https://github.com/microsoft/graphrag.git
cd graphrag
3. 下载依赖并初始化项目
(1)安装poetry资源包管理工具及相关依赖:
pip install poetry
poetry install
(2)初始化
poetry run poe index --init --root .
正确运行后,此处会在graphrag目录下生成output、prompts、.env、settings.yaml文件
4. 下载并将待检索的文档document放入./input/目录下
mkdir ./input
curl https://www.xxx.com/xxx.txt > ./input/book.txt #示例,可以替换为任何的txt文件
5.修改相关配置文件
(1)修改.env文件(默认是隐藏的)中的api_key
vi .env #进入.env文件,并修改为自己的api_key
修改后是全局配置,后续不需要再次修改了
(2)修改settings.yaml文件,修改其中的使用的llm模型和对应的api_base
提前说明,因为GraphRAG需要多次调用大模型和Embedding,默认使用的是openai的GPT-4,花费及其昂贵(
土豪当我没说,配置也不需要改),建议大家可以使用其他模型或国产大模型的api
我这里使用的是agicto提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度,白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称,修改完成后的settings文件完整内容如下:
(代码行后有标记的为需要修改的地方),如果用的是agicto则则不用修改settings.yaml
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: deepseek-chat #修改
model_supports_json: false # recommended if this is available for your model.
api_base: https://api.agicto.cn/v1 #修改
# max_tokens: 4000
# request_timeout: 180.0
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small #修改
api_base: https://api.agicto.cn/v1 #修改
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
6.构建GraphRAG的索引(耗时较长,取决于document的长度)
poetry run poe index --root .
成功后如下:
⠋ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents
🚀 All workflows completed successfully.
7.进行查询
此处GraphRAG提供了两种查询方式
1)全局查询 :更侧重全文理解
poetry run poe query --root . --method global "本文主要讲了什么"
运行成功后可以看到输出结果
2)局部查询:更侧重细节
poetry run poe query --root . --method local "本文主要讲了什么"
运行成功后可以看到输出结果
8. 总结
上述过程为已经验证过的,如果报错可以检查是否正确配置api_key及api_base
二、在python包的基础上进行(快速尝试)
1. 环境安装
pip install graphrag
2. 初始化项目
创建一个临时的文件夹graphrag,用于存在运行时数据
mkdir ./graphrag/input
curl https://www.xxx.com/xxx.txt > ./myTest/input/book.txt // 这里是示例代码,根据实际情况放入自己要测试的txt文本即可。
cd ./graphrag
python -m graphrag.index --init
3. 配置相关文件(可参考上述的配置文件过程)
4. 执行并构建图索引
python -m graphrag.index
5.进行查询
1)全局查询
python -m graphrag.query --root ../myTest --method global "这篇文章主要讲述了什么内容?"
2)局部查询
python -m graphrag.query --root ../myTest --method local "这篇文章主要讲述了什么内容?"
总结
通过以上两种方式,我们已经尝试了利用源码和python资源包进行配置GraphRAG的方式。大家可以按照自己的需求尝试以上两种方法。如果还有问题,欢迎在评论区讨论!