Python 爬虫中的异常处理
在Python中,异常处理通常使用try-except
语句块来实现。你可以捕获特定的异常类型,也可以捕获通用异常。
1. 捕获特定异常
针对常见的网络请求异常和解析异常,可以捕获具体的异常类型,例如requests.exceptions.RequestException
、AttributeError
等。
示例代码:
import requests
from bs4 import BeautifulSoup
def fetch_page(url):
try:
response = requests.get(url, timeout=10) # 设置超时时间
response.raise_for_status() # 检查HTTP响应状态码
return response.text
except requests.exceptions.RequestException as e:
print(f"网络请求失败: {e}")
except Exception as e:
print(f"发生未知异常: {e}")
return None
def parse_page(html):
if not html:
return []
try:
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', class_='item')
data = []
for item in items:
title = item.find('h2').text.strip()
price = item.find('span', class_='price').text.strip()
data.append({'title': title, 'price': price})
return data
except AttributeError as e:
print(f"HTML解析失败: {e}")
return []
# 示例用法
url = "https://example.com"
html = fetch_page(url)
if html:
data = parse_page(html)
print(data)
2. 使用日志记录异常
在生产环境中,建议使用logging
模块记录异常信息,以便后续分析和排查问题。
示例代码:
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def fetch_page(url):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
logging.error(f"网络请求失败: {e}")
except Exception as e:
logging.error(f"发生未知异常: {e}")
return None
3. 重试机制
对于网络请求失败的情况,可以设置重试机制,提高爬虫的鲁棒性。
示例代码:
import time
from requests.exceptions import RequestException
def fetch_page_with_retry(url, max_retries=3):
retries = 0
while retries < max_retries:
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.text
except RequestException as e:
retries += 1
logging.warning(f"请求失败,正在重试 ({retries}/{max_retries}): {e}")
time.sleep(2) # 等待2秒后重试
logging.error(f"重试次数已达上限,放弃请求")
return None
Java 爬虫中的异常处理
在Java中,异常处理通常使用try-catch
语句块来实现。你可以捕获特定的异常类型,例如IOException
、ParseException
等。
1. 捕获特定异常
针对常见的网络请求异常和解析异常,可以捕获具体的异常类型。
示例代码:
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Scanner;
public class WebScraper {
public static String fetchPage(String url) {
try {
HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
connection.setRequestMethod("GET");
connection.setConnectTimeout(10000); // 设置连接超时
connection.setReadTimeout(10000); // 设置读取超时
int responseCode = connection.getResponseCode();
if (responseCode == HttpURLConnection.HTTP_OK) {
Scanner scanner = new Scanner(connection.getInputStream());
StringBuilder response = new StringBuilder();
while (scanner.hasNext()) {
response.append(scanner.nextLine());
}
scanner.close();
return response.toString();
} else {
System.out.println("请求失败,状态码: " + responseCode);
}
} catch (IOException e) {
System.err.println("网络请求失败: " + e.getMessage());
} catch (Exception e) {
System.err.println("发生未知异常: " + e.getMessage());
}
return null;
}
public static void main(String[] args) {
String url = "https://example.com";
String html = fetchPage(url);
if (html != null) {
System.out.println(html);
}
}
}
2. 使用日志记录异常
在生产环境中,建议使用日志框架(如Log4j
或SLF4J
)记录异常信息。
示例代码:
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class WebScraper {
private static final Logger logger = LogManager.getLogger(WebScraper.class);
public static String fetchPage(String url) {
try {
HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
connection.setRequestMethod("GET");
connection.setConnectTimeout(10000);
connection.setReadTimeout(10000);
int responseCode = connection.getResponseCode();
if (responseCode == HttpURLConnection.HTTP_OK) {
Scanner scanner = new Scanner(connection.getInputStream());
StringBuilder response = new StringBuilder();
while (scanner.hasNext()) {
response.append(scanner.nextLine());
}
scanner.close();
return response.toString();
} else {
logger.error("请求失败,状态码: " + responseCode);
}
} catch (IOException e) {
logger.error("网络请求失败: " + e.getMessage());
} catch (Exception e) {
logger.error("发生未知异常: " + e.getMessage());
}
return null;
}
public static void main(String[] args) {
String url = "https://example.com";
String html = fetchPage(url);
if (html != null) {
logger.info(html);
}
}
}
3. 重试机制
对于网络请求失败的情况,可以设置重试机制,提高爬虫的鲁棒性。
示例代码:
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Scanner;
public class WebScraper {
public static String fetchPageWithRetry(String url, int maxRetries) {
int retries = 0;
while (retries < maxRetries) {
try {
HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
connection.setRequestMethod("GET");
connection.setConnectTimeout(10000);
connection.setReadTimeout(10000);
int responseCode = connection.getResponseCode();
if (responseCode == HttpURLConnection.HTTP_OK) {
Scanner scanner = new Scanner(connection.getInputStream());
StringBuilder response = new StringBuilder();
while (scanner.hasNext()) {
response.append(scanner.nextLine());
}
scanner.close();
return response.toString();
} else {
logger.error("请求失败,状态码: " + responseCode);
}
} catch (IOException e) {
retries++;
logger.warn("请求失败,正在重试 (" + retries + "/" + maxRetries + "): " + e.getMessage());
try {
Thread.sleep(2000); // 等待2秒后重试
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
}
} catch (Exception e) {
logger.error("发生未知异常: " + e.getMessage());
break;
}
}
logger.error("重试次数已达上限,放弃请求");
return null;
}
public static void main(String[] args) {
String url = "https://example.com";
String html = fetchPageWithRetry(url, 3);
if (html != null) {
logger.info(html);
}
}
}
总结
通过合理设置异常处理机制,可以有效提升爬虫的稳定性和可靠性。主要的异常处理策略包括:
-
使用
try-catch
捕获异常。 -
使用日志记录异常信息。
-
设置重试机制处理网络异常。
-
对不同类型的异常进行分类处理。
-
在
finally
块中清理资源。
在实际开发中,可以根据爬虫的具体需求和目标网站的特点,灵活调整异常处理策略,确保爬虫能够在复杂环境下稳定运行。