smolagents学习笔记系列(十)Examples - Web Browser Automation with Agents
这篇文章锁定官网教程中 Examples
章节中的 Web Browser Automation with Agents
文章,主要介绍了如何设计一个由Agent驱动结合视觉模态的Web内容浏览功能,包含了以下几个功能:
- Navigate to web pages:前往指定网页;
- Click on elements:点击网页对象;
- Search within pages:在页面中搜索;
- Handle popups and modals:处理页面弹窗内容;
- Extract information :抽取信息;
- 官网链接:https://huggingface.co/docs/smolagents/v1.9.2/en/examples/web_browser;
安装以下依赖:
$ pip install smolagents selenium helium pillow -q
为了实现上面这些功能,需要完成以下步骤:
- 定义能够对网页进行操作的 tool,包括可以执行
Ctrl+F
、后退、关闭弹窗的功能; - 配置浏览器内核,官网示例中使用了 Chrmoe 浏览器内核;
- 定义Agent和模型;
- 明确操作提示词;
- Agnet执行操作提示词;
完整代码如下:
【注意】:官网示例中使用的是 meta-llama/Llama-3.3-70B-Instruct
模型,但这个模型的Token是需要购买的,如果这里对其进行修改像之前文章中一样使用默认分配的 Qwen-Coder
那么会在中间某一步停下来,因为默认的免费模型不支持超过 10000
Token 的输入,有条件的读者可以尝试购买一些Token实验其完整功能。
from io import BytesIO
from time import sleep
import helium
from dotenv import load_dotenv
from PIL import Image
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from smolagents import CodeAgent, tool
from smolagents.agents import ActionStep
from smolagents import HfApiModel
load_dotenv()
#----------------------------------------------------------------#
# Step1. 定义网页操作tool
@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
"""
Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
Args:
text: The text to search for
nth_result: Which occurrence to jump to (default: 1)
"""
elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
if nth_result > len(elements):
raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
result = f"Found {len(elements)} matches for '{text}'."
elem = elements[nth_result - 1]
driver.execute_script("arguments[0].scrollIntoView(true);", elem)
result += f"Focused on element {nth_result} of {len(elements)}"
return result
@tool
def go_back() -> None:
"""Goes back to previous page."""
driver.back()
@tool
def close_popups() -> str:
"""
Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows!
This does not work on cookie consent banners.
"""
webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
#----------------------------------------------------------------#
# Step2. 配置Chrome内核
# Configure Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--force-device-scale-factor=1")
chrome_options.add_argument("--window-size=1000,1350")
chrome_options.add_argument("--disable-pdf-viewer")
chrome_options.add_argument("--window-position=0,0")
# Initialize the browser
driver = helium.start_chrome(headless=False, options=chrome_options)
# Set up screenshot callback
def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:
sleep(1.0) # Let JavaScript animations happen before taking the screenshot
driver = helium.get_driver()
current_step = memory_step.step_number
if driver is not None:
for previous_memory_step in agent.memory.steps: # Remove previous screenshots for lean processing
if isinstance(previous_memory_step, ActionStep) and previous_memory_step.step_number <= current_step - 2:
previous_memory_step.observations_images = None
png_bytes = driver.get_screenshot_as_png()
image = Image.open(BytesIO(png_bytes))
print(f"Captured a browser screenshot: {image.size} pixels")
memory_step.observations_images = [image.copy()] # Create a copy to ensure it persists
# Update observations with current URL
url_info = f"Current url: {driver.current_url}"
memory_step.observations = (
url_info if memory_step.observations is None else memory_step.observations + "\n" + url_info
)
#----------------------------------------------------------------#
# Step3. 定义 Agent
# Initialize the model
# 如果你有下面这个模型的Token则使用下面这两行代码
# model_id = "meta-llama/Llama-3.3-70B-Instruct"
# model = HfApiModel(model_id)
# 如果你只有免费的Token则使用下面这一行代码
model = HfApiModel()
# Create the agent
agent = CodeAgent(
tools=[go_back, close_popups, search_item_ctrl_f],
model=model,
additional_authorized_imports=["helium"],
step_callbacks=[save_screenshot],
max_steps=20,
verbosity_level=2,
)
# Import helium for the agent
agent.python_executor("from helium import *", agent.state)
#----------------------------------------------------------------#
# Step4. 明确操作提示词
helium_instructions = """
You can use helium to access websites. Don't bother about the helium driver, it's already managed.
We've already ran "from helium import *"
Then you can go to pages!
Code:
```py
go_to('github.com/trending')
```<end_code>
You can directly click clickable elements by inputting the text that appears on them.
Code:
```py
click("Top products")
```<end_code>
If it's a link:
Code:
```py
click(Link("Top products"))
```<end_code>
If you try to interact with an element and it's not found, you'll get a LookupError.
In general stop your action after each button click to see what happens on your screenshot.
Never try to login in a page.
To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
Code:
```py
scroll_down(num_pixels=1200) # This will scroll one viewport down
```<end_code>
When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
Just use your built-in tool `close_popups` to close them:
Code:
```py
close_popups()
```<end_code>
You can use .exists() to check for the existence of an element. For example:
Code:
```py
if Text('Accept cookies?').exists():
click('I accept')
```<end_code>
"""
search_request = """
Please navigate to https://en.wikipedia.org/wiki/Chicago and give me a sentence containing the word "1992" that mentions a construction accident.
"""
#----------------------------------------------------------------#
# Step5. Agent执行提示词
agent_output = agent.run(search_request + helium_instructions)
print("Final output:")
print(agent_output)
这里使用免费的Token执行结果如下,Agent会卡在中间的一步中,这个完全随缘,有时候刚打开网页还没有滚动就报错Token超限,有时候能滚动很多次才报错:
$ python demo.py