当前位置：首页 > article >正文

python中数据处理库，机器学习库以及自动化与爬虫

article 2025/2/19 6:23:36

Python 在数据处理、机器学习和自动化任务方面非常强大，它的库生态系统几乎涵盖了所有相关领域。我们将从以下几个部分来介绍 Python 中最常用的库：

数据处理库：Pandas、NumPy 等
机器学习库：Scikit-learn、TensorFlow、Keras 等
自动化与爬虫：Selenium、Requests、BeautifulSoup、Scrapy 等

一、Python 中的数据处理库

1.1 Pandas

Pandas 是 Python 最流行的数据处理库之一，专门用于处理结构化数据（如表格、CSV 文件等）。它引入了两种主要的数据结构：Series 和 DataFrame，可以高效地进行数据操作。

Pandas 基本用法

安装 Pandas：
```
pip install pandas
```

创建 DataFrame：

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Salary': [50000, 60000, 70000]}

df = pd.DataFrame(data)
print(df)

读取和写入 CSV 文件：

# 读取 CSV 文件
df = pd.read_csv('data.csv')

# 写入 CSV 文件
df.to_csv('output.csv', index=False)

常见数据操作：

# 查看前几行数据
print(df.head())

# 过滤数据
df_filtered = df[df['Age'] > 30]

# 添加新列
df['Bonus'] = df['Salary'] * 0.1

# 分组并聚合
grouped = df.groupby('Age').mean()

# 缺失值处理
df.fillna(0, inplace=True)  # 用 0 填充缺失值

1.2 NumPy

NumPy 是 Python 的数值计算库，专门用于处理大规模的数组和矩阵运算。Pandas 底层数据结构基于 NumPy。

NumPy 基本用法

安装 NumPy：
```
pip install numpy
```

创建数组：

import numpy as np

# 创建一维数组
arr = np.array([1, 2, 3])

# 创建二维数组
matrix = np.array([[1, 2], [3, 4]])

数组运算：

# 数组元素相加
arr_sum = arr + 2

# 矩阵乘法
mat_mul = np.dot(matrix, matrix)

数组统计：

# 求和
total = np.sum(arr)

# 均值
mean = np.mean(arr)

# 标准差
std_dev = np.std(arr)

1.3 数据可视化库：Matplotlib 与 Seaborn

Matplotlib 是一个基础的数据可视化库，Seaborn 则是在 Matplotlib 之上构建的更高级别的库，提供了更简洁美观的绘图接口。

安装 Matplotlib 和 Seaborn：
```
pip install matplotlib seaborn
```

Matplotlib 示例

import matplotlib.pyplot as plt

# 生成简单的折线图
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Simple Line Plot')
plt.show()

Seaborn 示例

import seaborn as sns

# 加载示例数据集
tips = sns.load_dataset("tips")

# 生成一个散点图
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()

二、Python 中的机器学习库

2.1 Scikit-learn

Scikit-learn 是一个功能强大的机器学习库，包含了经典的机器学习算法、数据预处理工具和模型评估功能。它特别适合用来构建和训练传统机器学习模型，如回归、分类、聚类等。

安装 Scikit-learn：
```
pip install scikit-learn
```

Scikit-learn 基本用法

加载数据集：

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

训练模型：

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 创建模型并训练
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

评估模型：

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

2.2 TensorFlow 和 Keras

TensorFlow 是一个流行的开源深度学习框架，Keras 是一个基于 TensorFlow 的高级神经网络库，提供了更加简洁的 API。它们被广泛用于构建和训练深度神经网络模型。

安装 TensorFlow 和 Keras：
```
pip install tensorflow
```

TensorFlow/Keras 基本用法

构建简单的神经网络模型：

import tensorflow as tf
from tensorflow.keras import layers

# 构建模型
model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(4,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(3, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 训练模型
model.fit(X_train, y_train, epochs=10)

# 评估模型
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test accuracy: {accuracy}")

2.3 PyTorch

PyTorch 是另一个流行的深度学习框架，因其动态计算图和灵活性而受到研究人员的青睐。

安装 PyTorch：
```
pip install torch
```

PyTorch 示例

import torch
import torch.nn as nn
import torch.optim as optim

# 构建一个简单的线性模型
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# 初始化模型、损失函数和优化器
model = SimpleModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 训练模型（假设你有数据 X 和 y）
for epoch in range(100):
    optimizer.zero_grad()
    outputs = model(torch.tensor([[1.0]]))  # 输入为 1
    loss = criterion(outputs, torch.tensor([[2.0]]))  # 期望输出为 2
    loss.backward()
    optimizer.step()

print("模型训练完成")

三、自动化与爬虫

3.1 自动化工具

Selenium

Selenium 是一个自动化 Web 浏览器的工具，广泛用于自动化测试和 Web 爬虫。

安装 Selenium：
```
pip install selenium
```

使用 Selenium 自动化浏览器操作：

from selenium import webdriver

# 启动浏览器
driver = webdriver.Chrome()

# 打开网页
driver.get("https://www.example.com")

# 查找元素并进行操作
element = driver.find_element_by_name("q")
element.send_keys("Selenium")
element.submit()

# 关闭浏览器
driver.quit()

3.2 网络请求库：Requests

Requests 是一个简单且功能强大的 HTTP 请求库，适合进行 API 请求和基本的 Web 爬取任务。

安装 Requests：
```
pip install requests
```

发送 HTTP 请求：

import requests

# 发送 GET 请求
response = requests.get('https://api.example.com/data')

# 解析 JSON 数据
data = response.json()
print(data)

3.3 BeautifulSoup

**Beautiful

Soup** 是一个用于解析 HTML 和 XML 的库，通常与 Requests 搭配使用，适合抓取网页数据。

安装 BeautifulSoup：
```
pip install beautifulsoup4
```

解析网页并提取数据：

from bs4 import BeautifulSoup
import requests

# 发送请求
response = requests.get('https://example.com')

# 解析 HTML
soup = BeautifulSoup(response.content, 'html.parser')

# 提取标题
title = soup.title.string
print(f"页面标题: {title}")

3.4 Scrapy

Scrapy 是一个用于构建强大 Web 爬虫的框架，适合大规模数据抓取任务。

安装 Scrapy：
```
pip install scrapy
```

Scrapy 基本示例：

scrapy startproject myspider

进入项目目录后，编辑 spiders 目录中的爬虫脚本。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行爬虫：
```
scrapy crawl quotes
```

总结

Python 拥有强大的库生态，涵盖了数据处理、机器学习、自动化以及 Web 爬虫等多个领域。你可以通过 Pandas 和 NumPy 高效处理数据，用 Scikit-learn 和 TensorFlow 构建机器学习模型，并通过 Selenium 和 Requests 等库实现 Web 自动化和爬虫任务。结合这些工具，可以轻松完成从数据采集到分析、建模和自动化的全流程。

如果你想进一步探索这些库，可以尝试更多实战项目，并结合具体的需求来选择合适的工具。

查看全文

http://www.kler.cn/a/321574.html