当前位置：首页 > article >正文

ScrapeGraphAl AI爬虫

article 2024/11/17 1:26:59

官网：https://scrapegraph-ai.readthedocs.io/en/latest/

from flask import Flask, request, jsonify
from scrapegraphai.graphs import SmartScraperGraph

app = Flask(__name__)

openai_key = "sk-xxxxxxxxxxxxxxxxxxxx"

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o",
    },
}


@app.route('/scrape', methods=['POST'])
def scrape():
    data = request.json
    source_url = data.get('source')

    if not source_url:
        return jsonify({"error": "No source URL provided"}), 400

    smart_scraper_graph = SmartScraperGraph(
        prompt="""请执行以下步骤：
    1. 仔细分析网页结构，识别并提取主要正文内容。
    2. 排除所有非正文元素，包括但不限于：导航菜单、侧边栏、页脚、广告、评论区、相关文章推荐等。
    3. 如果提取的正文内容超过14000个标记，请进行适当的总结，保留核心信息和主要观点。
    4. 直接返回处理后的正文内容或总结，不要添加任何额外的说明、标题或格式化。

    请确保返回的内容仅包含网页的实质性正文部分。""",
        source=source_url,
        config=graph_config
    )

    try:
        result = smart_scraper_graph.run()
        return jsonify({"result": result})
    except Exception as e:
        return jsonify({"error": str(e)}), 500


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)

代码参考的官网示例，比较好用，不过3.5经常会因为文章超长而报错，即使我提示词中写明了超过14000则总结也经常出错，因此用4o是个不错的选择，牺牲小部分速度，换来高准度

查看全文

http://www.kler.cn/a/323779.html