当前位置：首页 > article >正文

Crawl4AI 人工智能自动采集数据

article 2025/3/1 18:33:45

文章目录

1 使用 Crawl 的步骤
2 AI 智能体应用实例
3 结语

Crawl是一款免费的开源工具，利用AI技术简化网络爬取和数据提取，提高信息收集与分析的效率。它智能识别网页内容，并将数据转换为易于处理的格式，功能全面且操作简便。

定位：开源AI工具Crawl，简化数据爬取和分析，助力高效提取网站定价信息。
在这里插入图片描述

1 使用 Crawl 的步骤

步骤 1：安装与设置

pip install “crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk

步骤 2：数据提取

创建Python脚本，启动网络爬虫并从URL提取数据：

from crawl4ai import WebCrawler      # 创建 WebCrawler 的实例   crawler = WebCrawler()      # 预热爬虫（加载必要的模型）   crawler.warmup()      # 在 URL 上运行爬虫   result = crawler.run(url="https://openai.com/api/pricing/")      # 打印提取的内容   print(result.markdown)

步骤 3：数据结构化

使用LLM（大型语言模型）定义提取策略，将数据转换为结构化格式：

import os from crawl4ai import WebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field class OpenAIModelFee(BaseModel): model_name: str = Field(…, description=“OpenAI 模型的名称。”) input_fee: str = Field(…, description=“OpenAI 模型的输入令牌费用。”) output_fee: str = Field(…, description=“OpenAI 模型的输出令牌费用。”) url = ‘https://openai.com/api/pricing/’ crawler = WebCrawler() crawler.warmup() result = crawler.run( url=url, word_count_threshold=1, extraction_strategy= LLMExtractionStrategy( provider= “openai/gpt-4o”, api_token = os.getenv(‘OPENAI_API_KEY’), schema=OpenAIModelFee.schema(), extraction_type=“schema”, instruction=“”“从爬取的内容中提取所有提到的模型名称以及它们的输入和输出令牌费用。不要遗漏整个内容中的任何模型。提取的模型 JSON 格式应该像这样： {“model_name”: “GPT-4”, “input_fee”: “US$10.00 / 1M tokens”, “output_fee”: “US$30.00 / 1M tokens”}.”“” ), bypass_cache=True, ) print(result.extracted_content)
1
步骤 4：集成AI智能体

将 Crawl 与 Praison CrewAI 智能体集成，实现高效的数据处理：

pip install praisonai
1
创建工具文件（tools.py）来包装 Crawl 工具：

#tools.py

 import os   from crawl4ai 
 import WebCrawler   from crawl4ai.extraction_strategy 
 import LLMExtractionStrategy   from pydantic 
 import BaseModel, Field   from praisonai_tools 
 import BaseTool      

 class ModelFee(BaseModel):       
 llm_model_name: str = Field(..., description="模型的名称。")       
 input_fee: str = Field(..., description="模型的输入令牌费用。")       
 output_fee: str = Field(..., description="模型的输出令牌费用。")     

 class ModelFeeTool(BaseTool):       
 	 name: str = "ModelFeeTool"       
  	description: str = "从给定的定价页面提取模型的费用信息。"          
  	def _run(self, url: str):          
  		crawler = WebCrawler()         
     	crawler.warmup()            
     	result = crawler.run(  url=url,  word_count_threshold=1, extraction_strategy= LLMExtractionStrategy(                   				   provider="openai/gpt-4o",   api_token=os.getenv('OPENAI_API_KEY'),   schema=ModelFee.schema(),                   extraction_type="schema", nstruction="""从爬取的内容中提取所有提到的模型名称以及它们的输入和输出令牌费用。不要遗漏整个内容中的任何模型。提取的模型 JSON 格式应该像这样： {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" ),    bypass_cache=True,    )          
   return result.extracted_content      if __name__ == "__main__":      
         #测试 
         ModelFeeTool       tool = ModelFeeTool()      
          url = "https://www.openai.com/pricing"      
          result = tool.run(url)       print(result)

1
AI智能体配置
配置AI智能体使用Crawl工具进行网络抓取和数据提取。在crewai框架下，我们设定了三个核心角色，共同完成网站模型定价信息的提取任务：

网络爬虫：负责从OpenAI、Anthropic和Cohere等网站抓取定价信息，输出原始HTML或JSON数据。

数据清理员：确保收集的数据准确无误，并整理成结构化的JSON或CSV文件。

数据分析员：分析清理后的数据，提炼出定价趋势和模式，并编制详细报告。

整个流程无需额外依赖，各角色独立完成各自任务。