当前位置：首页 > article >正文

机器学习·逻辑回归

article 2025/2/21 22:43:06

前言

逻辑回归虽然名称中有 “回归”，但实际上用于分类问题。基于线性回归的模型，通过使用逻辑函数（如 Sigmoid 函数）将线性组合的结果映射到0到1之间的概率值，用于表示属于某个类别的可能性。

一、逻辑回归 vs 线性回归

特性	逻辑回归	线性回归
任务类型	分类（二分类为主）	回归（预测连续值）
输出范围	(0,1)（概率值）	(-∞, +∞)
核心函数	Sigmoid 函数	线性函数
损失函数	对数损失函数（交叉熵）	均方误差（MSE）
优化目标	最大化似然函数	最小化预测误差平方和

二、核心数学原理

1. Sigmoid 函数

公式：
\( \sigma(z) = \frac{1}{1 + e^{-z}} \)
作用：将线性组合 \( z = w^T x \) 映射到 (0,1) 区间，表示概率。
特性：
- 输出值可解释为样本属于正类的概率：\( P(y=1|x) = \sigma(w^T x) \)
- 当 \( z = 0 \) 时，\( \sigma(z) = 0.5 \)，为分类决策边界。

2. 损失函数（对数损失）

公式：
\( J(w) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)} \log(h_w(x^{(i)})) + (1-y^{(i)}) \log(1 - h_w(x^{(i)}))] \)
特点：
- 凸函数，可通过梯度下降求全局最优解。
- 惩罚预测概率与真实标签的偏离程度。

3. 梯度下降

权重更新公式：
\( w \leftarrow w - \alpha \cdot \frac{1}{m} X^T (h_w(X) - y) \)
关键参数：
- 学习率 \( \alpha \)：控制步长，需调参避免震荡或收敛过慢。
- 迭代次数：通常结合早停法（如 tol 参数）确定终止条件。

三、逻辑回归的 Python 实现

1. 手动实现核心代码

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def logistic_regression(X, y, lr=0.01, num_iter=10000):
    m, n = X.shape
    X = np.concatenate((np.ones((m, 1)), X), axis=1)  # 添加截距项
    w = np.zeros(n + 1)
    
    for _ in range(num_iter):
        z = np.dot(X, w)
        h = sigmoid(z)
        gradient = np.dot(X.T, (h - y)) / m
        w -= lr * gradient
    return w

2. 决策边界可视化

import matplotlib.pyplot as plt

# 假设 w 为训练得到的权重 [w0, w1, w2]
x1 = np.linspace(X[:,0].min(), X[:,0].max(), 100)
x2 = -(w[0] + w[1] * x1) / w[2]  # 解方程 w0 + w1x1 + w2x2 = 0

plt.scatter(X[:,0], X[:,1], c=y)
plt.plot(x1, x2, 'r-')
plt.show()

四、Scikit-learn 实现

from sklearn.linear_model import LogisticRegression

# 创建模型（带 L2 正则化）
model = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

# 评估模型
print("准确率:", model.score(X_test, y_test))
print("权重:", model.coef_)
print("截距:", model.intercept_)

五、一些问题

1. 为什么逻辑回归叫“回归”？

历史原因：虽然用于分类，但其数学形式继承自线性回归（通过线性组合 \( w^T x \) 建模）。
本质区别：通过 Sigmoid 函数将线性输出映射为概率，解决分类问题。

2. 为什么使用对数损失而非均方误差？

数学原因：均方误差在逻辑回归中会导致非凸优化问题，难以找到全局最优解。
概率解释：对数损失直接衡量预测概率分布与真实分布的差异。

3. 逻辑回归的局限性

线性决策边界：无法直接处理非线性可分数据（可通过引入多项式特征解决）。
多分类扩展：需借助 One-vs-Rest 或 Softmax 回归（多分类逻辑回归）。

六、应用

二分类问题：如垃圾邮件检测、疾病诊断。
概率预测：如用户点击广告的概率、信用卡违约概率。
特征重要性分析：通过权重系数大小判断特征对结果的影响方向及程度。

七、完整代码+示例

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# ======================
# 1. 生成模拟数据集
# ======================
X, y = make_classification(
    n_samples=500,  # 样本数量
    n_features=2,   # 特征数量
    n_redundant=0,  # 无冗余特征
    n_clusters_per_class=1,  # 每类样本聚集程度
    random_state=42
)
X = (X - X.mean(axis=0)) / X.std(axis=0)  # 标准化数据（提升梯度下降效率）

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ======================
# 2. 手动实现逻辑回归
# ======================
class LogisticRegressionManual:
    def __init__(self, learning_rate=0.01, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
        self.loss_history = []

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        # 梯度下降优化
        for _ in range(self.n_iters):
            linear_model = np.dot(X, self.weights) + self.bias
            y_pred = self._sigmoid(linear_model)

            # 计算梯度
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            # 更新参数
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

            # 记录损失
            loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
            self.loss_history.append(loss)

    def predict_proba(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        return self._sigmoid(linear_model)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

# 训练手动实现的模型
manual_model = LogisticRegressionManual(learning_rate=0.1, n_iters=2000)
manual_model.fit(X_train, y_train)

# ======================
# 3. Scikit-learn实现
# ======================
sklearn_model = LogisticRegression(penalty='none', solver='lbfgs', max_iter=2000)
sklearn_model.fit(X_train, y_train)

# ======================
# 4. 结果可视化
# ======================
def plot_decision_boundary(model, X, y, title):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    
    # 预测网格点
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap='coolwarm')
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plot_decision_boundary(manual_model, X_train, y_train, 'Manual Logistic Regression')

plt.subplot(1, 2, 2)
plot_decision_boundary(sklearn_model, X_train, y_train, 'Scikit-learn Logistic Regression')

plt.tight_layout()
plt.show()

# ======================
# 5. 模型评估
# ======================
# 手动模型评估
y_pred_manual = manual_model.predict(X_test)
print("手动实现模型评估:")
print(f"准确率: {accuracy_score(y_test, y_pred_manual):.4f}")
print(classification_report(y_test, y_pred_manual))

# Scikit-learn模型评估
y_pred_sklearn = sklearn_model.predict(X_test)
print("\nScikit-learn模型评估:")
print(f"准确率: {accuracy_score(y_test, y_pred_sklearn):.4f}")
print(classification_report(y_test, y_pred_sklearn))

# ======================
# 6. 损失函数变化曲线
# ======================
plt.figure(figsize=(8, 5))
plt.plot(manual_model.loss_history, label='Training Loss')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.title('Loss During Training (Manual Implementation)')
plt.legend()
plt.show()