当前位置: 首页 > article >正文

smolagents学习笔记系列(十)Examples - Web Browser Automation with Agents

这篇文章锁定官网教程中 Examples 章节中的 Web Browser Automation with Agents文章,主要介绍了如何设计一个由Agent驱动结合视觉模态的Web内容浏览功能,包含了以下几个功能:

  1. Navigate to web pages:前往指定网页;
  2. Click on elements:点击网页对象;
  3. Search within pages:在页面中搜索;
  4. Handle popups and modals:处理页面弹窗内容;
  5. Extract information :抽取信息;
  • 官网链接:https://huggingface.co/docs/smolagents/v1.9.2/en/examples/web_browser;

安装以下依赖:

$ pip install smolagents selenium helium pillow -q

为了实现上面这些功能,需要完成以下步骤:

  1. 定义能够对网页进行操作的 tool,包括可以执行 Ctrl+F、后退、关闭弹窗的功能;
  2. 配置浏览器内核,官网示例中使用了 Chrmoe 浏览器内核;
  3. 定义Agent和模型;
  4. 明确操作提示词;
  5. Agnet执行操作提示词;

完整代码如下:

【注意】:官网示例中使用的是 meta-llama/Llama-3.3-70B-Instruct 模型,但这个模型的Token是需要购买的,如果这里对其进行修改像之前文章中一样使用默认分配的 Qwen-Coder 那么会在中间某一步停下来,因为默认的免费模型不支持超过 10000 Token 的输入,有条件的读者可以尝试购买一些Token实验其完整功能。

from io import BytesIO
from time import sleep

import helium
from dotenv import load_dotenv
from PIL import Image
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

from smolagents import CodeAgent, tool
from smolagents.agents import ActionStep
from smolagents import HfApiModel

load_dotenv()

#----------------------------------------------------------------# 
# Step1. 定义网页操作tool
@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()

@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows!
    This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
    

#----------------------------------------------------------------# 
# Step2. 配置Chrome内核

# Configure Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--force-device-scale-factor=1")
chrome_options.add_argument("--window-size=1000,1350")
chrome_options.add_argument("--disable-pdf-viewer")
chrome_options.add_argument("--window-position=0,0")

# Initialize the browser
driver = helium.start_chrome(headless=False, options=chrome_options)

# Set up screenshot callback
def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot
    driver = helium.get_driver()
    current_step = memory_step.step_number
    if driver is not None:
        for previous_memory_step in agent.memory.steps:  # Remove previous screenshots for lean processing
            if isinstance(previous_memory_step, ActionStep) and previous_memory_step.step_number <= current_step - 2:
                previous_memory_step.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        memory_step.observations_images = [image.copy()]  # Create a copy to ensure it persists

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    memory_step.observations = (
        url_info if memory_step.observations is None else memory_step.observations + "\n" + url_info
    )
    
#----------------------------------------------------------------# 
# Step3. 定义 Agent

# Initialize the model
# 如果你有下面这个模型的Token则使用下面这两行代码
# model_id = "meta-llama/Llama-3.3-70B-Instruct"
# model = HfApiModel(model_id)
# 如果你只有免费的Token则使用下面这一行代码
model = HfApiModel()

# Create the agent
agent = CodeAgent(
    tools=[go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot],
    max_steps=20,
    verbosity_level=2,
)

# Import helium for the agent
agent.python_executor("from helium import *", agent.state)

#----------------------------------------------------------------# 
# Step4. 明确操作提示词

helium_instructions = """
You can use helium to access websites. Don't bother about the helium driver, it's already managed.
We've already ran "from helium import *"
Then you can go to pages!
Code:
```py
go_to('github.com/trending')
```<end_code>

You can directly click clickable elements by inputting the text that appears on them.
Code:
```py
click("Top products")
```<end_code>

If it's a link:
Code:
```py
click(Link("Top products"))
```<end_code>

If you try to interact with an element and it's not found, you'll get a LookupError.
In general stop your action after each button click to see what happens on your screenshot.
Never try to login in a page.

To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
Code:
```py
scroll_down(num_pixels=1200) # This will scroll one viewport down
```<end_code>

When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
Just use your built-in tool `close_popups` to close them:
Code:
```py
close_popups()
```<end_code>

You can use .exists() to check for the existence of an element. For example:
Code:
```py
if Text('Accept cookies?').exists():
    click('I accept')
```<end_code>
"""


search_request = """
Please navigate to https://en.wikipedia.org/wiki/Chicago and give me a sentence containing the word "1992" that mentions a construction accident.
"""

#----------------------------------------------------------------# 
# Step5. Agent执行提示词
agent_output = agent.run(search_request + helium_instructions)
print("Final output:")
print(agent_output)

这里使用免费的Token执行结果如下,Agent会卡在中间的一步中,这个完全随缘,有时候刚打开网页还没有滚动就报错Token超限,有时候能滚动很多次才报错:

$ python demo.py

在这里插入图片描述


http://www.kler.cn/a/565382.html

相关文章:

  • Linux设备驱动开发-Pinctrl子系统使用详解
  • 导入 Excel 规则批量修改或删除 Word 内容
  • 【Linux】进程间通信——命名管道
  • Python解决“比赛配对”问题
  • 爱普生SG-8101CE可编程晶振赋能智能手机的精准心脏
  • Redis 源码分析-内部数据结构 SDS
  • 在VSCode中使用MarsCode AI最新版本详解
  • 12. 三昧真火焚环劫 - 环形链表检测(快慢指针)
  • 【新手入门】SQL注入之盲注
  • 一周掌握Flutter开发--5、网络请求
  • JavaWeb后端基础(2)
  • 【Qt】为程序增加闪退crash报告日志
  • Python—Excel全字段转json文件(极速版+GUI界面打包)
  • spring结合mybatis多租户实现单库分表
  • Three.js 入门(几何体不同顶点组、设置不同材质、常见几何体)
  • CDN与群联云防护的技术差异在哪?
  • Java内存的堆(堆内、堆外)、栈含义理解笔记
  • 端口映射/内网穿透方式及问题解决:warning: remote port forwarding failed for listen port
  • 机器学习(模型的保存和加载)
  • 【版本控制安全简报】Perforce Helix Core安全更新:漏洞修复与国内用户支持