当前位置: 首页 > article >正文

INT305 Machine Learning

W1 Introduction

Nearest Neighbor

Preliminaries and Nearest Neighbor Methods
•    Suppose we’re given a novel input vector 𝑥 we’d like to classify.
•    The idea: find the nearest input vector to 𝑥 in the training set and copy its label.
•    Can formalize “nearest” in terms of Euclidean distance

Input Vectors

Common strategy: represent the input as an input vector in ℝ𝑑
➢    Representation = mapping to another space that’s easy to manipulate
➢    Vectors are a great representation since we can do linear algebra!

Mathematically, our training set consists of a collection of pairs of an input vector 𝑥 ∈ ℝ𝑑 and its corresponding target, or label, 𝑡
➢    Regression: 𝑡 is a real number (e.g., stock price)
➢    Classification: 𝑡 is an element of a discrete set {1, ⋯ , 𝐶}
➢    These days, 𝑡 is often a highly structured object (e.g. image)

Decision Boundaries

the boundary between regions of input space assigned to different categories.

We can visualize the behavio in the classification setting using a Voronoi diagram

k-Nearest Neighbors

Nearest neighbors sensitive to noise or mis-labeled data (“class noise”),to solve this, we use Smooth by having k nearest neighbors vote

𝕀 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 is the identity function and is equal to one whenever the statement is true. We could also write this as 𝛿 𝑡(𝑧), 𝑡(𝑖) , with 𝛿 𝑎, 𝑏 = 1 if 𝑎 = 𝑏, 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.

steps

1. KNN通常使用某种距离度量来找到与目标点最接近的k个邻居。最常见的距离度量包括1. 欧几里得距离(Euclidean Distance),2.曼哈顿距离(Manhattan Distance)3.曼哈顿距离(Manhattan Distance)

2. 选择k: 在计算出与目标点的距离后,选择距离最近的k个邻居。

3. 投票机制
对于分类任务,KNN通过多数投票来决定目标点的类别。每个邻居对目标点的类别投一票,最终选择票数最多的类别。
对于回归任务,KNN通常计算k个邻居的平均值作为预测值。

4. 预测公式:分类或者回归

k choosing

𝑘是一个超参数 hyperparameter 的例子,我们不能把它作为学习算法本身的一部分。

•    Small 𝑘:细粒度,过拟合
        ➢    Good at capturing fine-grained patterns.
        ➢    May overfit, i.e. be sensitive to random idiosyncrasies in the training data.
•    Large 𝑘:稳定结果,欠拟合
        ➢    Makes stable predictions by averaging over lots of examples.
        ➢    May underfit, i.e. fail to capture important regularities.

•    Balancing 𝑘

        ➢ Optimal choice of 𝑘 depends on number of data points 𝑛.
        ➢ Nice theoretical properties if 𝑘 → ∞ and 𝑘/𝑛 → 0.
        ➢ Rule of thumb(经验法则): choose 𝑘 < 𝑛.
        ➢ We can choose 𝑘 using validation set (验证集). 注意,测试集仅在最后使用,以测量最终配置的泛化性能。

Generalization
• We would like our algorithm to generalize to data it hasn’t seen before.
• We can measure the generalization error (error rate on new examples) using a test set.

W2: Linear Methods for Regression, Optimization

•    Second learning algorithm of the course: linear regression.
➢    Task: predict scalar-valued targets (e.g. stock prices)
➢    Architecture: linear function of the inputs
 

process

1. choose a model describing the relationships between variables of interest

2. define a loss function quantifying how bad the fit to the data is

3. choose a regularizer saving how much we prefer different candidate models (orexplanations of data)
4. fit a model that minimizes the loss function and satisfies the constraint/penaltyimposed by the regularizer, possibly using an optimization algorithm

Supervised Learning Set up

In supervised learning:

•    There is input 𝐱 ∈ 𝒳, typically a vector of features (or covariates)
•    There is target 𝑡 ∈ 𝒯, (also called response, outcome, output, class)
•    Objective is to learn a function 𝑓: 𝒳 → 𝒯 such that 𝑡 ≈ 𝑦 = 𝑓 𝐱 based on some data 𝒟 = { 𝐱 𝑖 , 𝑡 𝑖 𝑓𝑜𝑟 𝑖 = 1, 2, ⋯ , 𝑁}

Linear Regression

• 𝑡(𝑖) ≈ 𝑦 𝑖 = 𝐰⊺𝐱 𝑖 + 𝑏
• Different 𝐰, 𝑏 define different lines.

Model

Model: In linear regression, we use a linear function of the features 𝐱 = 𝑥1, ⋯ , 𝑥𝐷 ∈ ℝ𝐷 to make predictions 𝑦 of the target value 𝑡 ∈ ℝ:

• 𝐰和𝑏都是参数

• 我们希望我们的预测y接近目标t

Loss Function

A loss function ℒ(𝑦, 𝑡) defines how bad it is if, for some example 𝐱, the algorithm predicts 𝑦, but the target is actually 𝑡. 其实就是判断预测的好不好的标准

Cost Function 成本函数

loss function averaged over all training examples.

成本函数是一个通用的概念,用于衡量模型预测的性能。它是一个函数,通过对模型的参数进行优化来最小化预测误差。不同的机器学习任务可以使用不同的成本函数。常见的成本函数包括:

  • 对于回归问题:均方误差(MSE)或平均绝对误差(MAE)。
  • 对于分类问题:交叉熵损失(Cross-Entropy Loss)或对数损失(Log Loss)。
Squared error loss function 均方误差

均方误差是用于评估模型预测与实际观察值之间差异的指标

Vectorization 向量化

优点:方程和代码,将是简单和更易读的。摆脱了假变量/索引! •向量化代码要快得多

cost function:

步骤

1. We can organize all the training examples into a design matrix 𝐗 with one row per training example, and all the targets into the target vector 𝐭.

2. 计算整个数据集的预测结果:

3. Computing the squared error cost across the whole dataset:

p.s The minimizer does not depend on 𝑁 (but optimization might!).

p.s We can also add a column of 1’s to design matrix, combine the bias and the weights,
and conveniently write

Then, our predictions reduce to 𝐲 = 𝐗𝐰.

Solving the Minimization Problem

Direct Solution I: Linear Algebra

➢    To show 𝑧∗ minimizes 𝑓(𝑧), show that ∀𝑧, 𝑓 𝑧 ≥ 𝑓 𝑧∗
➢    To show that 𝑎 = 𝑏, show that 𝑎 ≥ 𝑏 and 𝑏 ≥ 𝑎

Direct Solution II: Calculus

光滑函数的最小值(如果存在)出现在临界点critical point,即导数为零的点。

多元泛化Multivariate generalization:将偏导数partial derivatives设为零(或等价于梯度)。

(具体证明记得后面补!)

Polynomial Feature Mapping

The relation between the input and output may not be linear.

We can still use linear regression by mapping the input features to another space using feature mapping (or basis expansion). 𝜑 𝐱 : ℝ𝐷 → ℝ𝑑 and treat the mapped feature (in ℝ𝑑 ) as the input of a linear regression procedure.

如果关系不是线性的,我们可以拟合一个多项式。

Regularization 正则化

Restricting the number of parameters / basis functions (M) is a crude approach to control the model complexity.
Another approach: keep the model large, but regularize it.
Regularizer正则化器:一个量化我们对一个假设的偏好程度的函数。

L2 Ridge Regression岭回归

这迫使学习算法不仅要拟合数据,还要使模型权重尽可能小,输出平滑smooth模型

We can encourage the weights to be small by choosing as our regularizer the 𝐿2 penalty.


注意:准确地说,𝐿2范数是欧氏距离,我们正在正则化平方𝐿2范数。

The regularized cost function makes a tradeoff between fit to the data and the norm of the weights.

Gradient Descent

gradient descent : to minimize the cost function which is more broadly applicable

We initialize the weights to something reasonable (e.g., all zeros) and repeatedly adjust
them in the direction of steepest descent.

证明过程后面要好好补!

for L2

Stochastic Gradient Descent 随机梯度下降


Learning Rate

In stochastic training, the learning rate also influences the fluctuations, due to thestochasticity of the gradients. 

Typical strategy:

1) Use a large learning rate early in training so you can get close to the optimum

2)Gradually decay the learning rate to reduce the fluctuations

W3 Linear Classifiers, Logistic Regression, Multiclass Classification

Classification: predicting a discrete-valued target
Binary classification: predicting a binary-valued target
Multiclass classification: predicting a discrete(>2)-valued target

Binary classification

predicting a binary-valued target

•    Classification: given a 𝐷-dimensional input 𝐱 ∈ ℝ𝐷 predict a discrete-valued target
•    Binary: predict a binary target 𝑡 ∈ {0,1}  

         Training examples with 𝑡 = 1 are called positive examples, and training examples with 𝑡 = 0 are called negative examples
➢    𝑡 ∈ {0,1} or 𝑡 ∈ {−1, +1} is for computational convenience.

Linear: model prediction 𝑦 is a linear function of 𝐱, followed by a threshold 𝑟:


use case:

Predict whether a patient has a disease, given the presence or absence of various symptoms
Classify e-mails as spam or non-spam
Predict whether a financial transaction is fraudulent

how to find w

If training set is linearly separable, we could solve for w using linear programming
        >We could also apply an iterative procedure known as the perceptron algorithm (butthis is primarily of historical interest).
lf it's not linearly separable, the problem is harder
        > Data is almost never linearly separable in real life                                                                  

Logistic Regression

用于二分类问题,例如判断电子邮件是否为垃圾邮件、客户是否会流失等。

定义损失函数,然后尝试最小化结果成本函数

Instead: define loss function then try to minimize the resulting cost function

Recall : cost is loss averaged (or summed) over the training set

0-1 loss

Optimization: NP-hard

This is due to the step function (0-1 loss) not being nice (continuous/smooth/convex)

Linear Regression --

用于预测连续的数值输出。例如,预测房价、温度等。

有时候我们可以用一个更容易优化的函数来代替我们关心的损失函数。这被称为平滑替代损失函数(surrogate loss function) 的松弛。

Activation Function

Sigmoid

Gradient Descent for Logistic Regression

Multiclass Classification -- 

Multiclass classification: predicting a discrete(>2)-valued target

Targets form a discrete set 1, ⋯ , 𝐾 .It’s often more convenient to represent them as one-hot vectors, or a one-of-K encoding

Softmax

cross-entropy loss

Softmax Regression

W4 Support Vector Machine, SVM Loss and Softmax Loss

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification andregression tasks. lt is particularly effective in high-dimensional spaces and is well-suited for scenarios where thereis a clear margin of separation between classes.

The basic idea behind SVM is to find the hyperplane that best divides a dataset into two classes

The term "support vector" refers to the data points that lie closest to the decision boundary, and these points arecrucial in determining the optimal hyperplane.

The goal is to maximize the margin, which is the distance between the hyperplane and the nearest data points ofeach class.

Binary Classification with a Linear Model

•    Classification: Predict a discrete-valued target
•    Binary classification: Targets 𝑡 ∈ {−1, +1}
•    Linear model:

𝑧 = 𝐰⊺ + 𝑏 (hyperplane)

𝑦 = sign(𝑧)

Optimal Separating Hyperplane: A hyperplane that separates two classes and maximizes the distance to the closest point from either class, i.e., maximize the marginof the classifier.

W* 是法向量的单位化形式

Maximizing Margin as an Optimization Problem

正确分类的条件

这个超平面应确保:最大化两类之间的“间隔”(也就是两类样本到分界线的最小距离)。

为了找到这个最佳分界线,我们关注的主要是那些“最接近”分界线的样本。
在数学上,我们称这些“离分界线最近的样本”为“支持向量support vectors”,并且这些样本的代数间隔(algebraic margin)为1。所谓的“代数间隔”为1是指它们距离分界线的距离达到SVM的要求,是边界上最接近分界线的点。

SVM-like algorithms are often called max-margin or large-margin.

Soft Margin constraint

在实际数据中,可能无法找到一个能够完美分隔所有数据的超平面。例如,数据可能有噪声或部分重叠,导致硬间隔(Hard Margin)SVM无法找到一个“完全正确分开”数据的决策边界。这时,我们需要允许某些数据点“放松”一下约束,即可以容忍某些点:

  • 在间隔以内(接近或位于分界线附近)。
  • 被误分类(在错误的一侧)。

因此,软间隔支持向量机允许某些点不完全遵守硬间隔的规则,通过引入“slack variables 松弛变量”来表示这种偏离。

SVM引入了一个惩罚机制,来控制这些松弛变量的总量,从而在最大化间隔和最小化误分类率之间找到一个平衡。这个惩罚机制通常通过一个参数 C 来控制:

  • 较大的 C 倾向于减少松弛变量的数量,使模型更加注重正确分类。
  • 较小的 C 则允许更多点进入间隔内甚至被误分类,使模型更加宽容,通常可以增加泛化能力。

最终,软间隔支持向量机的目标是在最大化间隔的同时,尽量减少总的松弛量(即违背间隔约束的程度),从而得到一个更具鲁棒性的分类

优化目标

 

软间隔 SVM 可以看作是带有 hinge loss 和 L2正则化项的线性分类器。

  • Hinge Loss: 提高分类精度,限制点接近或跨越分类边界。
  • L2 正则化: 增大间隔,减少模型复杂度。

Multiclass SVM Loss

Softmax

Softmax vs SVM

W5 Neural Network and Back Propagation

W6 Convolutional Neural Network

Convolutionall Layer

 A feature map is obtained by repeated application of a function across sub-regions of the entire image, in other words, by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function.

Convolution layer

A feature map is obtained by repeated application of a function across sub-regions of the entire image, in other words, by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function.

这里我们就只考虑计算的事情了 

If we denote the k-thfeature map at a given layer as hk, whose filters are determined by the weights wk and bias bk, then the feature mapis obtained using convolution as follows (fortanhnon-linearities):

Suppose that we have some NxN square neuron layer which is followed by our convolutional layer. If we use an mxm filter ω, our convolutional layer output will be of size (N−m+1)x(N−m+1).
 

Convolution Kernel

卷积核的计算就是讲各个位置上卷积核的元素和input的对应元素相乘再求和

Padding

在进行卷积层的处理之前,又是要向输入数据的周围填入固定的数据,比如0

Stride

应用滤波器的位置间隔

calculation*

非常重要!

input size(H,W)

Fliter size(FH, FW)

Outputsize(OH,OW)

padding P

stride S

channel C

The number of filters (also the number of output channels) K

output size

OH = (H+2P-FH)/S +1

OW = (W+2P-FW)/S +1

但是一般情况下其实input,output 和 filter 都是正方形的

so if H =W, FH = FW = F

O = (H + 2P -F)/S +1

parameters

(FH​×FW​×Cin​+1)×K (1,if there exists bias)

The number of parameters is determined by the filter size and the number of filters, regardless of the stride

e.g

Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size:
(32+2*2-5)/1+1=32 spatially, so 32x32x10

Number of parameters in this layer
Each filter has 5*5*3 +1 = 76 params => 76*10=760

Fully connected layer

全连接层之前就有提到过概念,这里就说一下计算

parameter

(Nin​×Nout​)+Nout​(bias)

Non-linearity Layer

不止relu, 常用的还有 sigmoid, Tanh

RELU (Rectified Linear Units)

ReLU(x)=max(0,x)

即当输入为正数时输出保持不变,而当输入为负数时输出为零。ReLU具有以下特点:

•    ReLUs are much simpler computationally
        –    The forward and backward passes through an ReLU are both just a simple if statement
        –    The sigmoid activation requires computing an exponent
        –    This advantage is huge when dealing with big networks with many neurons, and can significantly reduce both training and evaluation times

•    Sigmoid activations are easier to saturate
        –    There is a comparatively narrow interval of inputs for which the sigmoid's derivative is sufficiently nonzero
        –    In other words, once a sigmoid reaches either the left or right plateau, it is almost meaningless to make a backward pass through it, since the derivative is very close to 0
•    ReLUs only saturate when the input is less than 0
        –    Even this saturation can be eliminated using leaky ReLUs
•    For very deep networks, saturation hampers learning, and so ReLUs provide a nice workaround

不过,ReLU也有一些缺点,比如“死亡ReLU”现象,指的是当神经元输入总是小于0时,它将无法更新。

Sigmoid 

1/(1+e^-x)

Sigmoid maps any real-valued input to a value between 0 and 1. The function squashes large positive values towards 1 and large negative values towards 0.

Computation: Sigmoid requires calculating an exponential function for each input, which is more computationally expensive compared to ReLU.

Frequently used in the output layer for binary classification tasks, where the output needs to be a probability (between 0 and 1). For example, in logistic regression and binary classification neural networks.

calculation 

总之输入大小是多大就是多少,每个都得算

OH​×OW​×K

Pooling Layer

Pooling layer(池化层)通常用于卷积神经网络(CNNs)中,主要功能是降低特征图的尺寸,从而减少计算量、内存使用并控制模型的过拟合。Pooling层通过对输入特征图的局部区域进行降采样,提取重要的特征,忽略不重要的细节(preserves important information while discarding irrelevant detail)。实现对位置或光照条件变化的不变性、对杂乱的鲁棒性和表示的紧凑性,都是池化的共同目标.

Subsampling (pooling) Mechanism

    The exact positions of the extracted features are not important
    Only relative position of a feature to another feature is relevant
    Reduce spatial resolution – Reduce sensitivity to shift and distortion

常见的Pooling操作有以下几种:

  1. Max Pooling(最大池化)

    • 选择局部区域内的最大值作为输出。
    • 主要用于保留最显著的特征,同时减少特征图的尺寸。
  2. Average Pooling(平均池化)

    • 计算局部区域内的平均值作为输出。
    • 更倾向于平滑特征图,降低局部噪声。
  3. Global Pooling(全局池化)

    • 对整个特征图进行池化,通常用于输出一维向量(如分类任务的最后一层)。

Pooling layers do not have learnable parameters (no weights or biases). They perform a fixed operation (max or average) on the input feature map and reduce its spatial size.

Fully Connected Layer (FC layer)

主要用于特征提取后的信息融合和任务决策(如分类、回归等)

Contains neurons that connect to the entire input volume, as in ordinary Neural Networks

工作原理

  • 线性变换:通过权重矩阵 W 和偏置 b,将输入特征映射到一个新的特征空间。
  • 非线性激活:常见激活函数(如 ReLU、Sigmoid、Tanh)用于引入非线性特性,使网络能够拟合复杂的函数关系。

比如,对于输入向量 x,全连接层的输出为:

a = Activation(Wx+b)

由于全连接层参数量大,容易导致过拟合和计算效率低,因此通常会有以下改进:

  1. Dropout:在训练过程中随机丢弃部分连接,防止过拟合。
  2. Batch Normalization:对输入进行归一化,稳定训练过程。

CaseStudy

LeNet-5,AlexNet,ZFNet, VGGNet, GoogleNet, ResNet

Lecture7 Dicision Trees & Bias-Variance Decomposition

Decision Trees

Make predictions by splitting on features according to a tree structure.

Internal nodes test a feature
Branching is determined by the feature value
Leaf nodes are outputs (predictions)

* Descrete Features

Miscellany:
•    Problems:
➢    You have exponentially less data at lower levels
➢    Too big of a tree can overfit the data
➢    Greedy algorithms don’t necessarily yield the global optimum
•    Handling continuous attributes
➢    Split based on a threshold, chosen to maximize information gain

Learning

Resort to a greedy heuristic:
➢ Start with the whole training set and an empty decision tree.
➢ Pick a feature and candidate split that would most reduce the loss.
➢ Split on that feature and recurse on subpartitions.

Constructing & Evaluation

Idea: use counts at leaves to define probability distributions; use a probabilistic notion of uncertainty to decide splits.

Choose them based on how much information we would gain from the decision!

Quantifying Uncertainty -- entropy

The entropy of a discrete random variable is a number that quantifies the uncertainty inherent in its possible outcomes. Can also think of entropy as the expected information content of a random draw from a probability distribution.

the entropy of a discrete random variable 𝑌 is given by

•    “High Entropy”
➢    Variable has a uniform like distribution over many outcomes
➢    Flat histogram
➢    Values sampled from it are less predictable
•    “Low Entropy”
➢    Distribution is concentrated on only a few outcomes
➢    Histogram is concentrated in a few areas
➢    Values sampled from it are more predictable

proproties

𝐻 is always non-negative
➢ Chain rule: 𝐻 (𝑋, 𝑌) = 𝐻 (𝑋| 𝑌) + 𝐻 (𝑌) = 𝐻 (𝑌| 𝑋) + 𝐻(𝑋)
➢ If 𝑋 and 𝑌 independent, then 𝑋 does not affect our uncertainty about 𝑌: 𝐻 (𝑌| 𝑋) = 𝐻(𝑌)
➢ But knowing 𝑌 makes our knowledge of 𝑌 certain: 𝐻(𝑌|𝑌) = 0
➢ By knowing 𝑋, we can only decrease uncertainty about 𝑌: 𝐻(𝑌|𝑋) ≤ 𝐻(𝑌)

e.g 

Joint Distribution

Conditional Entropy

The expected conditional entropy

1/4 是p(is raining), 3/4 是p(not raining)

Information Gain  信息增益

是一种用于衡量某个特征对目标变量的贡献程度的指标,广泛应用于决策树构建等机器学习领域。它的核心思想是通过选择某个特征来划分数据,可以使目标变量的不确定性(通常以熵表示)减少多少。

Information gain 𝐼𝐺 (𝑌| 𝑋) in 𝑌 due to 𝑋, or the mutual information of 𝑌 and 𝑋

If 𝑋 is completely uninformative about 𝑌: 𝐼𝐺 (𝑌|X) =0
If 𝑋 is completely informative about 𝑌: 𝐼𝐺 (𝑌|X) = H(Y)

ps(Y|left) = -(3/4log2(3/4) + 1/4log2(1/4) ) ≈ 0.81

Bias Variance Decomposition(偏差-方差分解)--

Suppose the training set 𝓓 consists of pairs (𝒙𝑖, 𝑡𝑖) sampled independent and identically distributed (i.i.d) from a single data generating distribution 𝑝sample.
帮助我们理解如何在模型训练过程中权衡偏差和方差。通过合理控制模型的复杂度,我们可以在避免欠拟合和过拟合之间找到最佳的平衡,提升模型的泛化能力。

是机器学习中用于分析模型误差来源的一种方法,特别是在监督学习中,它帮助我们理解如何平衡 偏差(Bias)方差(Variance) 以提高模型的泛化能力。其核心思想是将模型的总误差分解为三个部分:

Basic Setup

Bayes Optimality

这段内容推导了在回归任务中偏差-方差分解的数学原理,并证明了为什么将预测值设为条件期望 E[t∣x]\mathbb{E}[t|x]E[t∣x] 是最佳选择,同时将误差分为三部分:偏差(bias)方差(variance)Bayes error。以下是逐步解释:

误差分解公式

将误差分解为 Bias 和 Variance

Lecture8 Bagging & Boosting

Bias and Variance

 affect the three terms of the expected loss

Bagging

是一种统计学方法,用于评估统计量(如均值、方差等)的分布和置信区间,尤其是在样本量较小或分布未知时。其核心思想是通过重复抽样来模拟样本的分布特性。

aim: train a single model on the union of all sampled datasets

Solution: given training set D, use the empirical distribution pD as a proxy for psample.This is called bootstrap aggregation, or bagging
        > Take a single dataset D with n examples.
        > Generate m new datasets (“resamples" or “bootstrap samples"), each by sampling n training examples from D, with replacement.
        >Average the predictions of models trained on each of these datasets.
The bootstrap is one of the most important ideas in all of statistics!
        >Intuition:As |D|→∞, we have pD →psample.

Effect on Hypothesis Space(假设空间)

bagging does not affect bias. But it can change the hypothesis space / inductive bias.
 

For Binary Classification

Effect of Correlation

Random Forests

随机森林是一种集成学习方法,它通过构建多个决策树并将它们的预测结果进行集成(投票或平均)来提高模型的准确性和鲁棒性。它属于**装袋(Bagging)**方法,即通过自助法(bootstrap)从训练数据中生成多个子集,并在这些子集上训练多个决策树

Boosting

增强通过生成弱分类器集合来减少偏差。训练每个分类器以减少先前集成的错误。它对过度拟合具有相当的弹性,尽管它可能会过度拟合。

Boosting是一种集成学习(Ensemble Learning)方法,通过将多个弱分类器(或回归器)组合成一个强分类器,来提高模型的准确性和性能。Boosting的核心思想是逐步构建模型,每次构建时都着重改进前一个模型的错误,从而提升最终预测的准确性

在Boosting算法中,每一轮训练都会使用加权训练集,这个权重的更新方式是关键所在。Boosting通过调整每个样本的权重,逐步聚焦在模型预测错误的样本上,从而改进模型的表现。

AdaBoost(Adaptive Boosting) 要会算!

这是最经典的Boosting算法。AdaBoost通过调整样本的权重,使得每次训练的弱学习器都更关注那些前一轮分类错误的样本。最后,通过加权投票或加权平均将各个弱分类器的预测结果结合起来。

Keystep
        1.    At each iteration, re-weight the training samples by assigning larger weights to samples (i.e., data points) that were classified incorrectly.
        2.    Train a new base classifier based on the re-weighted samples.
        3.    Add it to the ensemble of classifiers with an appropriate weight.
        4.    Repeat the process many times.
•    Requirements for base classifier:
        ➢    Needs to minimize weighted error.
        ➢    Ensemble may get very large, so base classifier must be fast. It turns out that any so-called weak learner /classifier suffices.

Individually, weak learners may have high bias (underfit). By making each classifier focus on previous mistakes, AdaBoost reduces bias.

计算

StagewiseTraining of Additive Models

Additive Models 加性模型

是一类统计模型,其基本特点是将不同特征的影响进行加性组合,假设每个输入特征对输出的贡献是独立的,并且对整体预测结果的贡献是通过加法的方式累积的。加性模型在统计学和机器学习中广泛应用,尤其在建模非线性关系时表现良好。

  • 加性关系:输出是所有特征函数的和,假设特征之间的影响是独立的。
  • 非线性建模:通过将每个特征的影响建模为非线性函数(例如,使用平滑函数、样条等),加性模型能够捕捉复杂的非线性关系。
  • 灵活性:由于每个特征的函数是单独建模的,模型具有很高的灵活性,能处理不同特征对目标变量的非线性影响。

StagewiseTraining

Additive Models with Exponential Loss

Bagging vs Boosting

both are ensembles combine classifiers to improve performance

Boosting
        Reduces bias
        Increases variance (large ensemble can cause overfitting
        Sequential
        High dependency between ensemble elements
Bagging
        Reduces variance (large ensemble can't cause overfitting)
        Bias is not changed (much)
        Parallel
        Want to minimize correlation between ensemble elements.

选择 Bagging 还是 Boosting?

  • Bagging
    • 数据噪声较大,易过拟合的场景。
    • 偏向降低方差。
    • 适合用 Random Forest 等算法。
  • Boosting
    • 数据比较干净,模型欠拟合。
    • 偏向降低偏差。
    • 适合用 Gradient Boosting 或 XGBoost 等算法。

Lecture9 Probabilistic Models

Bernoulli random variable 伯努利随机变量

It’s sensible to assume that{𝑥1, . . . , 𝑥𝑁}are independent and identically distributed (i.i.d.) Bernoullis.

 likelihood function

maximum likelihood criterion 最大似然准则*

在给定数据的情况下,找到模型参数的最优估计值,使得这些参数下观测到数据的概率最大

minimizing cross-entropies

> define a model that assigns a probability (or has a probability density at) to adataset
>maximize the likelihood (or minimize the neg. log-ikelihood).

转化成对数是为了方便计算

Discriminative vs Generative  approach

Discriminative

直接从标记的示例中估计决策边界/类分隔符 decision boundary/ class separator 的参数。

判别方法专注于估计决策边界,即直接学习给定输入特征下如何将不同类别进行区分。它并不关心每个类别的具体生成过程,而是专注于找出不同类别之间的分界线(决策边界),通过优化这些分界线来进行分类。

基本思想:模型直接从标记的示例中学习,学习如何将不同类别区分开来。换句话说,判别模型学习的是类别条件概率 p(y∣x),即给定输入 x,属于某个类别 y的概率。

Generative

建立类的输入特征分布模型(贝叶斯分类器 Bayes classifier)。

生成方法试图通过建立每个类别的特征分布模型来进行分类。也就是说,它学习的是每个类别 y 对应的输入特征 x 的条件概率分布 p(x∣y)。通过这些分布,生成模型可以描述如何从某个类别生成数据。

基本思想:模型通过学习每个类别的生成过程,即如何生成数据,并使用贝叶斯定理来推断给定输入 x属于某一类别 y的概率。换句话说,生成模型学习的是类别的联合分布 p(x,y),然后通过贝叶斯公式得到后验概率 p(y∣x)。

Bayes Classifier

贝叶斯定理,它用于计算给定特征 x 时,样本属于某一类别 y的后验概率。

  • 先验(Prior):是我们在观察到任何数据之前,对一个参数或事件的信念或知识。先验反映了我们在没有数据支持下对模型参数的假设或估计。先验可以基于理论推导、历史数据或专家经验来设定。

    在贝叶斯统计中,先验通常表示为 P(θ),其中 θ 是我们要估计的未知参数。

  • 后验(Posterior):是根据观察到的数据来更新我们对参数或事件的信念。它是基于 先验似然函数 (即数据的可能性)通过贝叶斯定理计算得出的。

    后验通常表示为 P(θ∣X),其中 X 是观察到的数据。

贝叶斯定理的公式为:

  • P(c∣X):后验分布,即在观察到数据 X 后,参数 c 的条件概率。
  • P(X∣c):似然函数,即在给定参数 c 的情况下,观察到数据 X的概率。
  • P(c):先验分布,即在没有数据的情况下,参数c的概率。
  • P(X):边际似然(或证据),是观察到数据 X 的总概率,通常通过积分或求和得到,表示所有可能参数值的加权平均

贝叶斯定理的本质在于 “更新信念”。通过引入新的数据,我们能够在先验的基础上修正对参数的估计。换句话说,后验分布结合了先验信息和数据的信息,提供了一个更为精确的对参数 c 的估计。

Naive Bayes

Naive Bayes assumes that the word features 𝑥𝑖 are conditionally independent given the class 𝑐.

朴素贝叶斯的“朴素”假设就是特征之间是条件独立的,

参数量小:
Training time: estimate parameters using maximum likelihood

Test time: apply Bayes’ Rule

Bayesian network

We can represent this model using an directed graphical model, or Bayesian network:

Learning

Each of these log-likelihood terms depends on different sets of parameters, so they can be optimized independently.

Inference

Bayesian Parameter Estimation**

Bayesian Parameter Estimation 是一种统计方法,用于估计模型的参数,同时考虑参数的不确定性。它以 贝叶斯定理 为核心,结合数据的观测结果和先验知识,为参数提供一个概率分布

In maximum likelihood, the observations are treated as random variables, but the parameters are not.

The Bayesian approach treats the parameters as random variables as well. β is the set of parameters in the prior distribution of θ.

To define a Bayesian model, we need to specify two distributions:

> The prior distribution p(θ), which encodes our beliefs about the parameters before we observe the data

>The likelihood p(Dlθ),same as in maximum likelihood

MLE issue: Data Sparsity

是指在数据集中,大部分特征或值是零或缺失的现象。在许多实际应用中,数据集包含大量的特征或变量,但大多数样本只有少数特征是有意义的,其它特征的值为零或缺失。

Maximum A-Posteriori Estimation 最大后验估计

是一种统计估计方法,用于在给定观测数据的条件下,估计未知参数的最可能值。MAP 估计是在贝叶斯推断的框架下的扩展,它结合了先验知识和观测数据,通过最大化后验概率来得到参数估计。

AP 估计的目标是找到一个参数值 θ^MAP,使得后验概率最大,即:

Lecture10  k-Means and EM Algorithm

k-means

我寻思不会考,就简单写写

steps

Initialization: randomly initialize cluster centers
•    The algorithm iteratively alternates between two steps:
➢    Assignment step: Assign each data point to the closest cluster
➢    Refitting step: Move each cluster center to the mean of the data assigned to it

converge

K-means algorithm reduces the cost at each iteration.

➢    Whenever an assignment is changed, the sum squared distances 𝐽 of data points from their assigned cluster centers is reduced.
➢    Whenever a cluster center is moved, 𝐽 is reduced.

Test for convergence: If the assignments do not change in the assignment step, we have converged (to at least a local minimum). This will always happen after a finite number of iterations, since the number of possible cluster assignments is finite

这个算法可能会面临卡在局部最优

Soft k-means

Instead of making hard assignments of data points to clusters, we can make soft assignments. e.g One cluster may have a degree of assignment of .7 for a datapoint and another may have a responsibility of .3.

  • 分配方式:每个数据点属于所有簇的概率不同,数据点不是硬性地分配给某个簇,而是根据某种权重分配给每个簇,通常使用隶属度(degree of assignment)表示。
  • 更新方式:Soft K-means使用一种加权的方式更新簇中心,数据点到各簇中心的距离决定了其隶属度,这样可以更柔性地调整簇的定义。

steps:

  • 随机初始化簇中心。
  • 对于每个数据点,计算它对每个簇的隶属度(通常使用距离的倒数或某种概率分布)。
  • 更新簇中心时,考虑每个数据点在各个簇中的隶属度,使用加权均值计算。

Gernerative View of clustering 

生成模型的目标是描述数据的生成过程。在这里,假设数据点 x 是根据某个隐变量 z (表示簇的类别)生成的。

  • z 是簇的标签,取值范围 {1,…,K}其中 K 是簇的个数。
  • 每个簇对应一个高斯分布,用于生成数据点。

Univariate Gaussian distribution

Describes a probability distribution for a single random variable that follows a normal distribution.

•    The central Limit Theorem says that sums of lots of independent random variables are approximately Gaussian.
•    In machine learning, we use Gaussians a lot because they make the calculations easy.

Multivariate Gaussian Distribution

Describes a probability distribution for multiple random variables that may be correlated and collectively follow a normal distribution.

Generative Model 

生成模型(Generative Model),常用于聚类问题中,尤其是在高斯混合模型(GMM, Gaussian Mixture Model)中

Clusters from Generative Model

表示数据点 x属于簇 k的概率

(a) 显示数据点在固定簇标签 z 的条件下的分布。

(b) 显示所有数据点的整体边际分布,不考虑簇标签。

(c) 显示数据点在不同簇中的责任概率(隶属度),体现了软聚类的概念。

Maximum Likelihood with Latent Variables

Gaussian Mixture Model (GMM)***

Gaussian Mixture Model (GMM) 是一种基于概率的聚类算法,它假设数据是由多个高斯分布(Gaussian distributions,也叫正态分布)混合生成的。

下面这个式子是高斯混合模型的核心公式,表示数据点的总体概率密度。它与贝叶斯公式密切相关,用于计算后验概率 p(z=k∣x)

逐步解析:

Fitting GMMs: Maximum Likelihood

这个公式表示 高斯混合模型 (GMM)对数似然函数 (log-likelihood),用于评估模型对给定数据的拟合程度。越大越好!!!

Maximum Likelihood

这一组公式是关于 高斯混合模型 (GMM)完全数据对数似然 (complete data log-likelihood),目的是简化最大似然估计 (Maximum Likelihood Estimation, MLE) 的计算。

1:完全数据对数似然的定义

2:分解联合概率

将联合概率分解为条件概率与先验概率:

 3:引入指示函数

最终形式:对于每个数据点 nnn,计算它属于所有簇的条件概率和先验概率的加权对数值。

如果我们知道 z^(n),可以直接通过最大化对数似然来估计参数:

attention,EM算法就是基于这个rk^(n)的

EM fit mixture of Gaussians***

EM 算法是一种迭代优化方法,通过反复迭代 E 和 M 步,最大化对数似然函数,直到收敛:

E 步骤 (Expectation): 估计数据点属于每个簇的概率(即 z(n)的后验分布)。

M 步骤 (Maximization): 最大化完全对数似然,用rk^(n)更新参数 μk 和 πk。

relation to k-means

•    The K-Means Algorithm:
1.    Assignment step: Assign each data point to the closest cluster
2.    Refitting step: Move each cluster center to the average of the data assigned to it

•     The EM Algorithm:
1.    E-step: Compute the posterior probability over 𝑧 given our current model
2.    M-step: Maximize the probability that it would generate the data it is currently responsible for.

AspectGMMK-means
Clustering TypeSoft clustering (probabilities).Hard clustering (clear assignments).
Cluster ShapeElliptical or arbitrary shapes.Spherical (round) clusters.
ModelProbabilistic (Gaussian distribution).Distance-based (Euclidean distance).
OutputProbability of each cluster.Single cluster assignment.
Optimization AimMaximizes the likelihood of the dataMinimizes the sum of squared Euclidean distances
ComplexityComputationally expensive.Computationally efficient.

Lecture11

不考,那我就不学(对手指)(目移)(吹口哨)

Lecture12

不考,那我就不学(对手指)(目移)(吹口哨)


http://www.kler.cn/a/471053.html

相关文章:

  • 【网络协议】IPv4 地址分配 - 第二部分
  • 关于FPGA中添加FIR IP核(采用了GOWIN EDA)
  • 【Python】基于blind-watermark库添加图片盲水印
  • 人工智能训练师一级(高级技师)、二级(技师)考试指南
  • 在环境冲突情况下调整优先级以解决ROS Catkin构建中缺少模块的问题【ubuntu20.04】
  • 谷粒商城-高级篇-Sentinel-分布式系统的流量防卫兵
  • Docker Compose 启动 Harbor 并指定网络
  • Power BI如何连接Azure Databricks数据源?
  • 什么是Lua协同程序?和线程有什么区别?
  • vue.js sync修饰符
  • STM32拓展 低功耗案例1:睡眠模式 (hal)
  • 【学习笔记】数据结构(十)
  • NLP三大特征抽取器(CNN/RNN/TF)
  • 【Uniapp-Vue3】navigator路由与页面跳转
  • Elasticsearch与数据库数据一致性:最佳实践与解决方案
  • 基于大数据爬虫+Python+数据可视化大屏的慧游数据爬虫与推荐分析系统(源码+论文+PPT+部署文档教程等)
  • Linux 安装 meilisearch
  • NUTTX移植到STM32
  • c#使用SevenZipSharp实现压缩文件和目录
  • Appium(一)--- 环境搭建
  • 【简博士统计学习方法】1. 统计学习的定义与分类
  • Functions
  • CANN 学习——基于香橙派 KunpengPro(1)
  • 03-其他
  • Java面试要点114 - Java ThreadLocal原理与内存泄漏
  • 《机器学习》——随机森林