当前位置：首页 > article >正文

INT301 Bio Computation

article 2025/3/1 6:53:50

Rule（规则）表格

Week	Rule Name	Training/Learning Algorithm	Description	Use Case
W1	Hebb's Rule	Hebbian Learning	Adjusts weights based on correlated activity between neurons.	Unsupervised learning, weight adjustment.
W3	Perceptron Rule	Gradient Descent	Updates weights for linearly separable data.	Binary classification, pattern recognition.
W3	Delta Rule	Gradient Descent	Minimizes classification error using gradient descent.	Non-linear classification.
W4	Gradient Descent	Batch/Incremental Gradient Descent	Optimizes weights by minimizing error function.	General optimization, classification tasks.
W4	Incremental Gradient Descent	Stochastic Gradient Descent	Updates weights after processing each data point.	Faster optimization, noisy updates.
W12	Competitive Learning	Winner-Takes-All	Only the winning neuron updates its weights.	Self-organizing, feature extraction.
W12	K-means Clustering	Iterative Centroid Finding	Finds cluster centers by minimizing intra-cluster distance.	Clustering, initialization for RBF networks.
W12	Oja's Rule	Hebbian-like Learning	Modifies weights to prevent them from growing unbounded.	Principal Component Analysis (PCA), feature extraction.

Detailed Model Table

	Model Name	Network Topology	Input/Output	Use Case
W1	McCulloch-Pitts Neuron (MP Neuron)-U	Single binary neuron with fixed weights and a threshold-based step activation function.	Binary inputs → Binary outputs	Logic operations, basic neural modeling.
W2	Perceptron-S	Single-layer network with fully connected neurons; linear activation function and binary threshold output.	Numerical inputs → Binary classification	Pattern recognition, linear classification.
W5	Multi-Layer Perceptron (MLP)-S	Fully connected multi-layer network: input layer → one or more hidden layers with nonlinear activation (e.g., sigmoid) → output layer.	Inputs → Multiclass outputs	Complex pattern recognition, regression.
W6	Convolutional Neural Network (CNN)-S	Hierarchical structure: convolutional layers with shared weights and filters → pooling layers → fully connected layers → output layer.	Images → Features/labels	Image recognition, object detection.
W8	Radial-Basis Function Network (RBF)-S	Three-layer network: input layer → hidden layer with Gaussian radial-basis functions → output layer with linear weights.	Data → Radial features	Function approximation, classification, Regression(non-linear
W9	Elman Network-S	Recurrent network with: input layer → hidden layer (with recurrent connections for context nodes) → output layer.	Sequential data → Outputs	Time-series prediction, dynamic systems.
W10	Recurrent Neural Network (RNN)-S	Recurrent network with feedback connections: input layer → hidden layer (connected to itself over time) → output layer.	Sequence data → Sequential outputs	Sequential data modeling, NLP.
	LSTM-S	Recurrent network with memory cells: input layer → memory cells (with input, forget, and output gates) → output layer.	Sequence data → Sequential outputs	Long-term dependencies in sequence data.
W11	Principal Component Analysis (PCA)-U	Linear model that maps high-dimensional data onto lower dimensions using eigenvectors as projection axes.	High-dimensional data → Reduced dimensions	Dimensionality reduction, feature extraction*. (Linear)
W12	Auto-encoder-U	Encoder-decoder structure: input layer → hidden layer (compressed representation) → output layer (reconstructed input).	Data → Compressed features	Feature extraction*, denoising. (Non-linear)
	Deep Auto-encoder-U	Multi-layer encoder-decoder: multiple hidden layers in both encoder and decoder to capture more complex representations.	Data → Compressed features	Complex feature extraction, higher abstraction.
	Denoising Auto-encoder-U	Multi-layer encoder-decoder trained with noisy input → output layer reconstructs clean input.	Noisy data → Clean features	Noise reduction, robust feature extraction.
W12	K-means Clustering-U*	Iterative Centroid Finding	Finds cluster centers by minimizing intra-cluster distance.	Clustering, initialization for RBF networks.
W13	Self-Organizing Map (SOM)-U*	Grid of neurons with lateral connections: input layer → competitive layer (with weight adaptation) → 2D map.	Data → Clusters	Data visualization (can Non-linear), clustering.
W14	Hopfield Network-U	Fully connected recurrent network: all neurons are connected to each other, storing patterns as attractor states.	Binary inputs → Stored patterns	Associative memory, pattern retrieval.

W1 M-P NEURON & HEBB LEARNING

McCulloch-Pitts Neuron (MP Neuron)

A binary discrete-time element with excitatory兴奋性 and inhibitory抑制性 inputs and an excitation threshold

The input values ai^t

from the i-th presynaptic neuron 突触前神经元 at any instant t may be equal either to 0 or 1 only

There is an excitation threshold θ associated with the neuron.

The weights of connections wi are
+1 for excitatory type connection and
-1 for inhibitory type connection

权重 (w):
在神经网络中，权重用于调节输入数据的影响力。每个输入特征都有一个相应的权重，权重决定了该输入对神经元输出的重要性。
数学上，神经元的输入与其权重相乘，然后将所有乘积相加，形成加权和。这种加权和会影响输出的计算。
激活 (a):
激活是神经元经过激活函数处理后的输出值。神经元的总加权输入（加权和）会通过激活函数，激活函数可以是线性的（如ReLU）或非线性的（如sigmoid或tanh）。
激活函数的作用是引入非线性，使神经网络能够拟合复杂的函数，帮助模型处理非线性数据和复杂的关系。

Output xt+1 of the neuron at the following instant t+1 is defined according to the rule

In the MP neuron, we shall call the instant total input St - instant state of the neuron

The state St of the MP neuron does not depend on the previous state of the neuron itself, but is simply

Activation function

The neuron output xt+1 is function of its state St, therefore the output also can be written as function of discrete time

where g is the threshold activation function

Here H is the Heaviside (unit step) function:

ANN Learning Rules

to acquire new knowledge through experience, learning

Learning means to change in response to experience.

The ideal free parameters to adjust, and so to resolve learning without changing patterns of connections, are the weights of connections wji

ANN learning rule: how to adjust the weights of connections to get desirable output.

Much work in artificial neural networks focuses on the learning rules that define
how to change the weights of connections between neurons to better adapt a network to serve some overall function.

Hebb’s Rule

Hebb proposed that a particular type of use-dependent modification of the connection strength of synapses might underlie learning in the nervous system（是一种神经可塑性理论，用来解释神经元之间如何通过协同活动来增强连接，从而支持学习和记忆的过程。）

Hebb proposed a neurophysiological postulate:
“ …when an axon of a cell A
1) is near enough to excite a cell B and
2) repeatedly and persistently takes part in firing it

to increase weight of connection at every next instant in the way:

其中 a 是 input 值，x 是 output 值，C 为学习率，当且仅当 input 和 output 都为 1 时，权重才会改变。

Equations emphasize the correlation nature of a Hebbian synapse

Hebb's Rule in practice?

W2 Supervised learning model: Perceptron

Learning

data

A set of data records (also called examples, instances or cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-defined class.

goal

To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.

Supervised learning 有监督学习: classification is seen as supervised learning from examples.

Unsupervised learning 无监督学习(e.g. clustering): Class labels of the data are unknown

Learning (training): Learn a model using the training data

Testing: Test the model using unseen test data to assess the model accuracy

Fundamental assumption of learning基本假设 :The distribution of training examples is identical to the distribution of test examples (including future unseen examples).

*To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data.

ANN for Pattern Recognition

Perceptron 感知机

Perceptrons are neural networks that change with “experience” using error-correcting rule.
According to the rule, weight of a response unit changes when it makes erroneous response to stimuli presented to the network（当一个反应单元对呈现给网络的刺激做出错误反应时，其权重会发生变化）

Units: The simplest architecture of perceptron comprises two layers(one layer of input units, and one layer of output units.) of idealised “neurons”. The two layers are fully interconnected(这题我会，全连接)

abstract neurons
processing elements of the perceptron

Input

The total input to the output unit j is

The sum is taken over all n+1 inputs units connected to the output unit j.
There is special bias input unit number 0 in the input layer.

bias input 偏置项

z=w1x1+w2x2+...+wnxn+b
There is a special bias input unit number 0 in the input layer.

The bias unit always produces inputs a 0 of the fixed values of +1

The input a0 of bias unit functions as a constant value in the sum.

The bias unit connection to output unit j has a weight adjusted in the same way as all the other weights

Output

The output value xj of the output unit j depends on whether the weighted sum is above or below the unit's threshold value. 总之输出只有0（未activate）和1（activate）

output vector

the ordered set of instant outputs of all units in the output layer constitutes an output vector of the network

Perceptron Training

Weight wji of connections between the two layers are changed according to perceptron learning/training rule (The process of weights adjustment) , so the network is more likely to produce the desired output in response to certain inputs

The error are computed and used to re-adjust the values of the weights of connections.

The weights re-adjustment is done in such a way that the network is – on the whole – more likely to give the desired response next time.

这里C是learning rate！

W3

Convergence of Perceptron

Delta rule

然后后面花了好大篇幅算这个：

Perceptron Rule

A weight of connection changes only if both the input value and the error of the output unit are not equal to 0.

The algorithm converges to the correct classification, if
• the training data is linearly separable;
• and the learning rate is sufficiently small, usually set below 1, which determines the amount of correction made in a single iteration.

For any data set that’s linearly separable, the learning rule is guaranteed to find a solution in a finite number of steps.
Assumption:
•   At least one such set of weights, w*, exists
•   There are a finite number of training patterns.
•   The threshold function is uni-polar (output is 0 or 1).

Network Performance

The network performance during training session can be measured by a root-mean-square (RMS) error value.

x^i是常数

The best performance of the network corresponds to the minimum of the RMS error
网络的性能取决于连接权重的设置，特别是在使用均方根误差（RMS error）作为评价指标的情况下。换句话说，网络的输出误差与权重的调整直接相关，其他因素对RMS error的影响不在这里讨论。
Fail: eventually performance stops improving, and the RMS error does not get smaller regardless of number of iterations.
Success: able to classify patterns similar to those of the training set.

Perceptron as network for classification

If we group the weights as a vector w , the net output y can be expressed as:

W.X is a hyperplane（超平面）, which in 2d is a straight line.

For 2 classes, view net output as a discriminant function(判别函数) y(x, w),where:
y(x, w) = 1 ,if x in class 1 (C1)
y(x, w) = -1, if x in class 2 (C2)

Decision boundary

For m classes, a classifier should partition the feature space into m decision regions
The line or curve separating the classes is the decision boundary.

decision surface

A perceptron represents a hyperplane decision surface in d-dimensional space, for example, a line in 2D, a plane in 3D, etc.

↑ This is the equation for points in x-space that are on the boundary

Linear Separability Problem

If two classes of patterns can be separated by a decision boundary, represented by the linear equation, then they are said to be linearly separable and the perceptron can correctly classify any patterns

without the bias term, the hyperplane will be forced to intersect origin.

Decision boundary (i.e., W,b) of linearly separable classes can be determined either by some learning procedures, or by solving linear equation systems based on representative patterns of each classes.
If such a decision boundary does not exist, then the two classes are said to be linearly inseparable.
Linearly inseparable problems cannot be solved by the simple perceptron network, more sophisticated architecture is needed

linearly inseparable eg

linearly separable eg

ANN Building Tips

1.   Understand and specify the problem in terms of inputs and required outputs
2.   Take the simplest form of network you think might be able to solve your problem
3.   Try to find the appropriate connection weights (including neuron thresholds) so that the network produces the right outputs for each input in its training data
4.   Make sure that the network works on its training data and test its generalization by checking its performance on new testing data
5.   If the network doesn’t perform well enough, go back to stage 3 and try harder
6.   If the network still doesn’t perform well enough, go back to stage 2 and try harder
7.   If the network still doesn’t perform well enough, go back to stage 1 and try harder
8. Problem solved – or not

W4 MORE ON PERCEPTRON LEARNING

Gradient Descent Rule 梯度下降

Perceptron rule fails if data is not linearly separable
Solution: uses gradient descent to search the hypothesis space
an unthresholded linear unit(无阈值线性单位) is an appropriate error measure:

The training is a process of minimizing the error E(w) in the steepest direction (most rapid decrease), that is in direction opposite to the gradient:

which leads to the gradient descent training rule:

Error Surface

the axes w0,w1represent possible values for the two weights of a simple linear unit
error surface must be parabolic(抛物线) with a single global minimum(全局最小值)

weight update derive (权重更新推导)

where xie denotes the i-th component of the example e. The gradient descent training rule becomes:

Gradient Descent Learning Algorithm 批量梯度下降

权重更新频率：批量梯度下降会在每个训练数据集的所有样本都通过网络后（即计算完整个数据集的损失）才更新一次权重。
计算方式：每一轮训练中，它会基于整个训练集的平均误差来计算梯度，并根据这个全局梯度调整权重。
优缺点：由于使用整个数据集来更新权重，它在每次更新时方向较稳定，收敛通常较为平滑，但如果数据集较大，每次计算的开销较大，训练速度较慢。

Incremental gradient descent 随机梯度下降

权重更新频率：在增量梯度下降中，权重会在每处理一个样本后立即更新，而不是等整个数据集处理完。
计算方式：基于每个单个样本的误差计算梯度，然后即时更新权重。这意味着每次迭代的更新方向可能会变化较大，因为每个样本的误差不同。
优缺点：由于权重更新频繁，它可以更快地反映新的样本信息，训练速度较快，特别适合大数据集。但是由于每个样本更新带来的噪声，收敛过程可能更不稳定，收敛路径会有更多波动。

梯度下降规则在计算所有样例累积的总误差后（after calculating the whole error accumulated ）更新权值，而增量规则通过在每个训练样例之后(after each training example.)更新权值来逼近梯度下降误差减小量。

Steps

Gradient Descent vs Perceptron

Gradient descent finds the decision boundary which minimizes the sum squared error of the (target - net) value rather than the (target - output) value
decision boundary may leave more instances misclassified as compared to the perceptron rule: could have a higher misclassification rate than with the perceptron rule
   uses unthresholded linear unit
   converges asymptotically toward a minimum error hypothesis
   termination is not guaranteed
   linear separability not neccessary

Perceptron rule (target - thresholded output) guaranteed to converge to a separating hyperplane if the problem is linearly separable.
Perceptron rule will find the decision boundary which minimizes the classification error – if the problem is linearly separable
   uses thresholded unit
   converges after a finite number of iterations
   output hypothesis classifies training data perfectly
   linearly separability neccessary

Sigmoidal Perceptrons

具有阈值或线性激活函数的简单单层感知器不能推广到更强大的学习机制，如多层神经网络。这就是开发具有s型激活函数的单层感知器的原因。

training process

Gradient Descent Learning Algorithm for Sigmoidal Perceptrons

Incremental Gradient Descent Learning Algorithm for Sigmoidal Perceptrons

Steps

W5 MULTI-LAYER PERCEPTRON

The multilayer perceptron(MLP) is a hierarchical structure of several perceptrons, which overcomes the shortcomings of the single-layer networks( able to learn nonlinear function mappings).

power:
learning arbitrary functions
learning continuous functions
learning Boolean functions

Differentiable Activation Functions

sigmoid

hyperbolic tangent

Multilayer Network Structure

The hidden units enable the multilayer network to learn complex tasks by extracting progressively more meaningful information from the input examples.

Backpropagation

MLP became applicable on practical tasks after the discovery of a supervised training algorithm, the error backpropagation learning algorithm.

反向传播（Backpropagation） 是一种用于训练神经网络的算法。它通过以下步骤更新网络的权重：

前向传播：输入数据通过网络，计算输出。
计算误差：根据输出和真实值，计算损失（如均方误差）。
反向传播误差：从输出层向回传播误差，通过链式法则计算每层的梯度。
更新权重：使用梯度下降算法，根据误差梯度更新每层的权重。

On-line Training

Batch Training

Revision by epoch is called Batch Learning，From mathematical point of view the error derivatives should be computed after each epoch, i.e.,after all examples in the training set have been processed.

特性

* training sample random order 效果更好

* initialization 一般是随机的，然后这个一开始的random weight 会导致不同的结果

*Hidden layer: 虽然指定一层以上的隐藏单元可能更方便（两层隐藏网络更强大，但对于实际遇到的许多任务，一层隐藏网络可能足够精确），但额外的层并不能增加识别的代表性，反而会降低网络的训练速度。

*停止分三种情况:1.error 2.到达设定的最大epoch 3.validation的早停策略

learning rate
虽然有可能对训练数据进行很好的拟合，但在独立测试数据上进行性能预测时，反向传播的应用充满了困难和陷阱, 学习率的选择对于找到误差距离的真正全局最小值（global minimum）至关重要。人们希望使用最大的学习率，但仍然收敛于最小解。

Momentum（动量)

动量是使用梯度递减项与前一个权重变化的一小部分的组合来稳定权重变化(This gives the system a certain amount of inertia since the weight vector will tend to continue moving in the same direction unless opposed by the gradient term)

1) it smooths the weight changes and suppresses cross-stitching, that is cancels side-to-side oscillations across the error valley;
2) when all weight changes are all in the same direction the momentum amplifies the learning rate causing a faster convergence;
3) enables to escape from small local minima on the error surface.

Generalization & Overfitting(泛化/过拟合)

泛化是指模型在未见过的数据上表现良好的能力，而过拟合是指模型在训练数据上表现过好，但在新数据上表现不佳，因为它过度拟合了训练集中的噪声或细节。

原因:

防止过拟合:

1.Use network that is just large enough to provide an adequate fit
 Don’t use a bigger network when a smaller one will work
 The network should not have more free parameters than there are training examples

2. Overfittingmay be prevented by early stopping, network pruning(网络减枝), and applying regularization techniques.（With sufficient nodes MLP can classify any training set exactly; But it may have poor generalisation ability.）

3. Weight Decay：Decreaseeach weight by some small factor during each iteration.
The motivation: to keep weight values small.
Add penalty term to the error function
Penalizes large weights to reduce variance
Standard weight decay equation

 Weight decay penalty term causes the weights to converge to smaller absolute values than they otherwise would.

Large weights can hurt generalization in two different ways.
   Excessively large weights leading to hidden units can cause the output function to be too rough, possibly with near discontinuities.
   Excessively large weights leading to output units can cause wild outputs far beyond the range of the data if the output activation function is not bounded to the same range as the data.
   The main risk with large weights is that the non-linear node outputs could be in one of the flat parts of the transfer function, where the derivative is zero. In such case the learning is irreversibly stopped.

4. Cross-Validation:

a set of validation data in addition to the training data.
 The algorithm monitors the error w.r.t. this validation data while using the training set to drive the gradient descent search.
Two copies of the weights are kept: one copy for training and a separate copy of the best weights thus far, measured by their error over the validation set. Once the trained weights reach a higher error over the validation set than the stored weights, training is terminated and the stored weights are returned.

数据集太小怎么办？↓
K-fold cross validation

在K-fold交叉验证中，原始样本被随机划分为K个子样本。在K个子样本中，保留一个子样本作为验证数据用于测试模型，其余K−1个子样本作为训练数据。

将数据集分为 K 个等分（通常是 5 或 10 折）。
每次用其中一个部分作为验证集，其他 K-1 个部分作为训练集，重复 K 次，每次验证集不同。
最后取 K 次结果的平均值评估模型性能。

Leave-one-out cross-validation

留一交叉验证（LOOCV）涉及使用原始样本中的单个观测值作为验证数据(a single observation from the original sample as the validation data)，其余观测值作为训练数据。

数据集有 n 个样本时，做 n 次交叉验证。
每次只留出一个样本作为验证集，其他 n-1 个样本作为训练集，重复 n 次。
最后取 n 次结果的平均值评估模型。

Limitations & Capabilities of MLP

W6 INTRODUCTION TO CONVOLUTIONAL NEURAL NETWORK

Convolutional Network (CNN)

1.1 卷积神经网络基础_哔哩哔哩_bilibili

入门或者想了解的稍微深入一点的我无脑推这个up的视频（可是现在搞cnn是49年入国军来着）

Convolutional Neural Networks is extension of traditional Multi-layer Perceptron, based on 3 ideas:
1.Local receive fields
2.Shared weights
3.Spatial / temporal sub-sampling

Sparse Connectivity

CNNs exploit spatially-local correlation by enforcing a local connectivity pattern between neurons of adjacent layers, i.e., the inputs of hidden units in layer m are from a subset of units in layer m-1, units that have spatially contiguous receptive fields

Sparse connectivity in Convolutional Neural Networks (CNNs) refers to the idea that not every neuron in one layer is connected to every neuron in the previous layer. Instead, neurons are selectively connected to certain parts of the input. In contrast to fully connected networks, where every neuron in a layer connects to all neurons in the previous layer, CNNs use sparse connections to focus only on local regions of the input data.

In the context of convolutional layers, sparse connectivity is implemented through the use of convolutional filters (or kernels). Each filter only connects to a small local region of the input feature map, known as the receptive field, rather than the entire input image. This approach reduces the number of parameters and computational complexity compared to fully connected networks.

想象m-1层是输入视网膜。在上图中，第m层的单元在输入视网膜中具有宽度为3的接受野，因此仅与视网膜层中相邻的3个神经元相连。层m+1中的单元与下面的层具有类似的连通性。我们说，它们的感受野相对于下一层也是3，但它们的感受野相对于输入更大(5)。每个单元对视网膜的感受野之外的变化没有反应。因此，该体系结构确保学习到的“过滤器”对空间局部输入模式产生最强的响应（learnt “filters” produce the strongest response to a spatially local input pattern.）。

Shared Weights

In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map

在上图中，我们展示了属于同一特征映射的3个隐藏单元。相同颜色的权重被共享约束为相同。梯度下降仍然可以用来学习这些共享参数，只需要对原始算法进行很小的改变。以这种方式复制单元允许检测特征，而不管它们在视野中的位置。此外，权值共享通过大大减少学习的自由参数数量来提高学习效率。模型上的约束（constraints on the model）使得cnn在视觉问题上能够实现更好的泛化。

Convolution layer

A feature map is obtained by repeated application of a function across sub-regions of the entire image, in other words, by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function.

这里我们就只考虑计算的事情了

If we denote the k-thfeature map at a given layer as hk, whose filters are determined by the weights wk and bias bk, then the feature mapis obtained using convolution as follows (fortanhnon-linearities):

Suppose that we have some NxN square neuron layer which is followed by our convolutional layer. If we use an mxm filter ω, our convolutional layer output will be of size (N−m+1)x(N−m+1).

Convolution Kernel

卷积核的计算就是讲各个位置上卷积核的元素和input的对应元素相乘再求和

Padding

在进行卷积层的处理之前，又是要向输入数据的周围填入固定的数据，比如0

Stride

应用滤波器的位置间隔

calculation*

非常重要！

input size(H,W)

Fliter size(FH, FW)

Outputsize(OH,OW)

padding P

stride S

channel C

The number of filters (also the number of output channels) K

output size

OH = (H+2P-FH)/S +1

OW = (W+2P-FW)/S +1

但是一般情况下其实input，output 和 filter 都是正方形的

so if H =W, FH = FW = F

O = (H + 2P -F)/S +1

parameters

(FH×FW×Cin+1)×K (1，if there exists bias)

The number of parameters is determined by the filter size and the number of filters, regardless of the stride

e.g

Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2
Output volume size:
(32+2*2-5)/1+1=32 spatially, so 32x32x10

Number of parameters in this layer
Each filter has 5*5*3 +1 = 76 params => 76*10=760

Fully connected layer

全连接层之前就有提到过概念，这里就说一下计算

parameter

(Nin×Nout)+Nout(bias)

(Nin+1）×Nout

Non-linearity Layer

不止relu, 常用的还有 sigmoid, Tanh

RELU （Rectified Linear Units）

ReLU(x)=max(0,x)

即当输入为正数时输出保持不变，而当输入为负数时输出为零。ReLU具有以下特点：

•   ReLUs are much simpler computationally
        –   The forward and backward passes through an ReLU are both just a simple if statement
        –   The sigmoid activation requires computing an exponent
        –   This advantage is huge when dealing with big networks with many neurons, and can significantly reduce both training and evaluation times

•   Sigmoid activations are easier to saturate
        –   There is a comparatively narrow interval of inputs for which the sigmoid's derivative is sufficiently nonzero
        –   In other words, once a sigmoid reaches either the left or right plateau, it is almost meaningless to make a backward pass through it, since the derivative is very close to 0
•   ReLUs only saturate when the input is less than 0
        –   Even this saturation can be eliminated using leaky ReLUs
•   For very deep networks, saturation hampers learning, and so ReLUs provide a nice workaround

不过，ReLU也有一些缺点，比如“死亡ReLU”现象，指的是当神经元输入总是小于0时，它将无法更新。

Sigmoid

1/(1+e^-x)

Sigmoid maps any real-valued input to a value between 0 and 1. The function squashes large positive values towards 1 and large negative values towards 0.

Computation: Sigmoid requires calculating an exponential function for each input, which is more computationally expensive compared to ReLU.

Frequently used in the output layer for binary classification tasks, where the output needs to be a probability (between 0 and 1). For example, in logistic regression and binary classification neural networks.

calculation

总之输入大小是多大就是多少，每个都得算

OH×OW×K

Pooling Layer

Pooling layer（池化层）通常用于卷积神经网络（CNNs）中，主要功能是降低特征图的尺寸，从而减少计算量、内存使用并控制模型的过拟合。Pooling层通过对输入特征图的局部区域进行降采样，提取重要的特征，忽略不重要的细节(preserves important information while discarding irrelevant detail)。实现对位置或光照条件变化的不变性、对杂乱的鲁棒性和表示的紧凑性，都是池化的共同目标.

Subsampling (pooling) Mechanism

   The exact positions of the extracted features are not important
   Only relative position of a feature to another feature is relevant
   Reduce spatial resolution – Reduce sensitivity to shift and distortion

常见的Pooling操作有以下几种：

Max Pooling（最大池化）：
- 选择局部区域内的最大值作为输出。
- 主要用于保留最显著的特征，同时减少特征图的尺寸。
Average Pooling（平均池化）：
- 计算局部区域内的平均值作为输出。
- 更倾向于平滑特征图，降低局部噪声。
Global Pooling（全局池化）：
- 对整个特征图进行池化，通常用于输出一维向量（如分类任务的最后一层）。

Pooling layers do not have learnable parameters (no weights or biases). They perform a fixed operation (max or average) on the input feature map and reduce its spatial size.

Normalization Layer

normalization layer is used to adjust and scale the activations of neurons in a network to ensure that the input to each layer of the network has a consistent distribution, which helps the model train more efficiently and converge faster.

CNN training

This is the typical processing at test time.
At training time, we need to compute an error measure and tune the parameters to decrease the error.

tune the parameters to decrease the loss： If loss is differentiable we can compute gradients. We can use back-propagation, to compute the gradients w.r.t. parameters at the lower layers

Transfer Learning

The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks (in new domains)

Lecture8 RADIAL-BASIS FUNCTION NETWORKS

Radial-Basis Function Networks (RBFNs) 是一种前馈神经网络，通常用于分类、回归和函数逼近任务。RBFN 是一种特殊类型的神经网络，其核心思想是使用Radial Basis Function，RBF作为激活函数，通常是高斯函数。

RBFN 的结构通常由三层组成：

输入层：接收输入数据。
隐含层：每个神经元的激活函数是基于输入与某个中心点的距离（通常使用高斯函数）来计算的。
输出层：通过线性组合隐含层的输出得到最终的结果。

它与传统的神经网络（如多层感知机）最大的不同在于隐含层的激活函数和训练方法。RBFN 由于其局部性质，RBF 网络在处理具有 局部特征 的数据（如函数逼近、插值任务）时表现得非常好。它非常适合解决一些 非线性 问题，尤其是当数据的分布不均时。传统神经网络（如 MLP）：多层感知器通过层间的全连接可以学习复杂的全局模式，适用于广泛的任务，但可能需要更长的训练时间，尤其是在深度网络中。

To solve:
Curve-fitting（曲线拟合） 和 interpolation（插值） 是两种用于处理数据点之间关系的数学技术，目标都是找到一个函数 f(x)，使其能够描述或逼近数据点的分布规律。

Radial-Basis Function

The radial-basis functions technique suggests constructing of interpolation functions F of the following form:

常用函数：

Gaussian Function

Structure

-   input layer: which passes the example vectors to the next layers
-   hidden layer: applies a non-linear transformation function to the inputs, and expands them in the usually very high-dimensional hidden space 该层的神经元采用 径向基函数（Radial Basis Function）作为激活函数，最常见的是高斯函数。每个隐含层神经元与输入之间都有一个特定的中心点（或称为原型），并根据输入与该中心点的距离来计算输出。
-   output layer: applies a linear transformation from the hidden space to the output space. 通常是一个线性层，用于将隐含层的输出映射到最终的预测值。输出层的神经元数通常取决于任务的具体要求（例如，分类任务的类别数或回归任务的输出维度）

Fuction Mapping

RBF Learning

Training Arlgorithm

Finding Centers - Clustering

Cluster Analysis

Cluster: a collection of data objects
n   Similar to one another within the same cluster
n   Dissimilar to the objects in other clusters
Cluster analysis
n   Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

Typical applications
- As a stand-alone tool to get insight into data distribution
- As a preprocessing step for other algorithms

A good clustering method will produce high quality clusters with high intra-class similarity, low inter-class similarity. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Distance

Distances are normally used to measure the similarity or dissimilarity between two data objects

K-means

n   The algorithm partitions data points into K disjoint subsets.
n   The clustering criteria:
        n   the cluster centers are set in the high density regions of data
        n   a data point is assigned to the cluster with which it has the minimum distance to the center

issues
- The numbers of clusters must be specified in advance
- not suitable for clusters with non-convex 非突 shapes
- sensitive to noise and outlier elements

improvement
Repeated k-means
        n   Try several random initializations and take the best.

Better initialization
        n   Use some better heuristic to allocate the initial distribution of code vectors.
                n   Designing good initialization is not any easier than designing good clustering algorithm at the first place!

Regularization Networks

正则化网络是一种用于解决插值和曲线拟合问题的机器学习方法，通过显式的正则化项控制模型复杂度，与核方法紧密相关。相比于普通神经网络，它更加注重平衡拟合精度和模型平滑性，特别适用于稀疏数据建模、高维数据和噪声数据的处理。

正则化网络的目标是找到一个函数 f(x)，使得：

它能很好地拟合训练数据（最小化数据误差）。
它具有一定的平滑性或复杂度约束（通过正则化项控制）。

The regularization network models the interpolation function F as a linear superposition of multivariate Gaussian functions, which number is equal to the number of the given input examples N:

Regularization Networks 的一个重要特性是其作为一个 universal approximator（通用逼近器） 的能力。这意味着，正则化网络通过足够数量的核函数和正则化项，能够在紧致集合上以任意精度逼近任意多元连续函数。

problem

computationally inefficient， network topology should be reduced,

This means that we will attempt to find a solution that approximates the solution produced by the regularization network.

Comparation

Lecture 9 TIME-SERIES PREDICTION AND ELMAN NETWORK

– Time Series is a series of timely ordered, comparable observations yt recorded in equidistant time intervals

In statistics and machine learning, a time-series is often described by a sequence of vectors (or scalars) which depend on time t:

{x(t0),x(t1)...x(ti),x(ti+1)...}

It’s the output of some process P that we are interested in prediction from Observed time-series to Unknown dynamics (process)

n The continuous signal x(t) is sampled at discrete points to get a series
n In uniform sampling, if sampling period is delta t

{x[t]} = {x(0),x(Δt), x(2Δt),x(3Δt)...}

Extending backward from time t, we have time series {x[t], x[t −1], · · ·}. From this, we now want to estimate x at some future time

solution:
1. Assume a generative model.
2. For every point x[ti] in the past, train the generative model with what preceded ti as the inputs and what followed ti as the desired.
3. Now run the model to predict xˆ [ t+s ] from (x[t], x[t-1]...)

difficulties

Limited quantity of data

Noise

Non-stationarity

Forecasting method selection

ANN prdiction

n   A number of adjoining data points of the time series (the input window Xt-s, Xt-s+1,…, Xt) are mapped to the interval [0,1] and used as activations for the input of the input layer.
n   The size s of the input window corresponds to the number of input units of the neural network.
n   In the forward path, the activations are propagated over hidden layers to output units. The error used for the BP training is computed by comparing the value of the output unit with the value of the time series at time t+1.
n   Training a MLP network with BP learning algorithm usually requires that all representations of the input set (called one epoch) are presented many times.

Dynamic Network

Sometimes, we require our model to be sensitive to inputs that were presented some time ago. our requirement is not met by a function (no matter how complex it may be) and demands a model which maintains a state, or memory, over several presentations.

Sequence Learning

不是一种特定的网络架构，而是指一种学习方式或任务，主要是处理和预测时序数据。这类任务包括语言建模、时间序列预测、机器翻译、语音识别等。为了处理这类任务，ANN通常会用特殊的网络结构来捕捉数据的时序依赖关系，如RNN、LSTM、GRU等。

•   MLP & RBF networks are static networks, i.e., they learn a mapping from a single input signal to a single output response for an arbitrary large number of pairs.
•   Dynamic networks learn a mapping from a single input signal to a sequence of response signals, for an arbitrary number of pairs (signal, sequence).
•   Typically the input signal to a dynamic network is an element of the sequence and then the network produces the rest of the sequence as a response.
•   To learn sequences we need to include some form of memory (short term memory) to the network.

•   We can introduce memory effects with two principal ways:
        •   Implicit: e.g., Time lagged signal as input to a static network or as recurrent connections
        •   Explicit: e.g., Temporal Backpropagation Method
•   In the implicit form, we assume that the environment from which we collect examples of (input signal, output sequence) is stationary. For the explicit form the environment could be non-stationary, i.e. the network can track the changes in the structure of the signal.

Dynamical Neural Networks

这是一个最广泛的范畴，指的是那些能够建模动态系统和时间序列的神经网络。DNNs 可以包括很多类型的网络，如 RNN、TDN、Elman Network 等。换句话说，DNNs 是一个总类，包含了多种具体的动态神经网络模型。

n   Recurrent networks can explicitly deal with inputs that are presented sequentially, as they would almost always be in real problems.
n   Recurrent networks are fundamentally different from feed forward networks in that they not only operate upon an input but also a state.
n   The ability of the net to reverberate and sustain activity can serve as a working memory.
n   Such nets are Dynamical Neural Networks.

结构关系↓

DNNs (Dynamical Neural Networks)
├── RNNs (Recurrent Neural Networks)
│ ├── Elman Network
└── TDNs (Time Delayed Networks)

Time Delayed Networks

Time Delayed Networks（TDNN）通过引入延迟结构来捕捉时序信息。

Signals can be buffered externally and presented at additional input nodes (input banks). Time Delayed Networks illustrate the implicit representation approach for sequence learning, which combine a static network (e.g. MLP / RBF) with a memory structure.

Elman Network**

是一种特殊的递归神经网络（RNN），其隐藏层包含一个短期记忆单元，能够学习时序依赖关系。

Elman network which typically distinguishes between a state-output function and an input-state mapping.

In Elman net, the internal representations are now engaged in the next step as an additional input.

Note that all equations are the same as for the feed forward case. The only difference is that we have another set of weights (U) which feed activations from the hidden nodes to the hidden nodes. Importantly, the feedback is delayed. Elman Network 是一种特殊类型的 RNN，通常用于建模动态系统。与标准的 RNN 不同，Elman 网络具有一个额外的隐藏状态，它保存了网络的上一个时间步的输出。这使得 Elman 网络在捕捉时间依赖性和系统的动态行为时非常有效。

Layers

n   Input layer
n   Hidden layer forming internal representation
n   Context layer
The value of the context neuron is used as an extra input signal for all the neurons in the hidden layer one time step later. In an Elman network, the weights from the hidden layer to the context layer are set to one and are fixed because the values of the context neurons have to be copied exactly
n   Output layer linear combination

learning

Lec 10 RECURRENT NEURAL NETWORK 递归神经网络

是一种专门用于处理序列数据的神经网络模型，它具有显著的能力来捕捉时间序列数据中的时序依赖关系。

vs Feed forward

feed forward

n   accept a fixed-sized vector as input
n   produce a fixed-sized vector as output
n   usually fixed amount of computational steps

rnn

n the idea behind RNNs is to make use of sequential information, as in a traditional neural network all inputs (and outputs) are assumed independent of each other
n but for many tasks that’s a very bad idea: if you want to predict the next word in a sentence you better know which words came before it

RNN Architecture

RNN的核心特点是循环连接（recurrent connections），即其隐含层的输出会反馈到网络的输入中，形成一个“环状”结构。这使得RNN能够保留先前时刻的输出信息，从而为后续时刻的处理提供上下文信息。

n   RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations (memory).
        n   inputs x(t)
        n   outputs y(t)
        n   hidden state s(t): memory of the network
        n   a delay unit is introduced to hold activation until they are processed at the next step
n   The decision a recurrent net reached at time step t-1 affects the decision it will reach one moment later at time step t （总之就是前面时间的数据会影响后面时间的）
        n   two sources of input: the present and the recent past, which combine to determine how they respond to new data

RNN topologies range from partly recurrent to fully recurrent
n partly recurrent is a layered network with distinct input and output layers where the recurrence is limited to the hidden layer
n in fully recurrent networks each node gets inputs from all other nodes

RNN Forward Pass*

If a network training sequence starts at time t0 and ends at t1, one possible choice of the total loss function is the sum over time of the square error function

unfolding*

n   The recurrent network can be converted into a feed forward network by unfolding over time The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out the network for the complete sequence.
        n   For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word.
n   This means all the earlier theory about feed forward networks learning follows through

n   In RNN errors can be propagated more than 2 layers in order to capture longer history information.
        n   This process is usually called unfolding.
        n   In an unfolded RNN, recurrent weight is duplicated for arbitrary number of time steps

n The above diagram has outputs at each time step, but depending on the task this may not be necessary.
n For example: when predicting the sentiment of a sentence we may only care about the final output, not the sentiment after each word. Similarly, we may not need inputs at each time step. The main feature of an RNN is its hidden state, which captures some information about a sequence.

Back Propagation Through Time (BPTT)

是RNN的一种训练方法，用来通过时间对网络中的参数（如权重和偏置）进行优化。它是标准反向传播算法的扩展，专门用于处理RNN的时间序列特性。

n   BPTT learning algorithm is an extension of standard backpropagation that performs gradients descent on an unfolded network.
n   The gradient descent weight updates have contributions from each time step.
n   The errors have to be back-propagated through time as well as through the network
n   For recurrent networks, the loss function depends on the activation of the hidden layer through its influence on the output layer and through its influence on the hidden layer at the next step.

但是会梯度消失噢！Vanishing gradients can become too small for computers to work with or for networks to learn.

LSTM

为了解决标准RNN中存在的梯度消失和梯度爆炸问题。它通过引入门控机制（Gating Mechanisms），使网络可以更有效地捕捉长程依赖关系。

Architecture

n   The basic unit in the hidden layer of an LSTM network is a memory block, it replaces the hidden unit in a traditional RNN.
n   A memory block contains one or more memory cell and a pair of adaptive multiplicative gating units which gates input and output to all cells in the block.
n   Memory blocks allow cells to share the same gates thus reducing the number of parameters.

input

n The layer decides which information from the input should be forwarded into the LSTM cell.
n By multiplying y(t−1) with the recurrent weight matrix Rz, the layer decides which information from the previous time step should be forwarded into the LSTM cell

input gate

n   The input gate controls write accesses to memory cells.
n   This is realized by using the result of the squashing function σ as a factor which will be later multiplied by the squashed cell input z(t).
n   The range of function σ is (0,1) and can be interpreted in case of 0 as write access denied and in case of 1 as write access granted.
n   Note that all values in between are also possible.

CEC

CEC（Cell State）是 LSTM 中用于存储和传递长期信息的关键部分，类似于网络的“记忆”。它通过遗忘门、输入门和输出门的控制，能够选择性地更新信息，避免梯度消失问题，从而有效捕捉长期依赖关系。

n Each memory cell contains a node with a self-connected recurrent edge of fixed weight one, ensuring that the gradient can pass across many time steps without vanishing – which is called CEC (constant error carousel)
n The CEC solves the vanishing error problem. In the absence of new input or error signals to the cell the CEC local error back flow remains constant, neither growing or decaying.

output gate

n   The output gate controls read accesses from memory cells.
n   This is realized by using the result of the squashing function σ as a factor which will be later multiplied by the squashed cell content h(c(t)).
n   The range of function σ is (0,1) and can be interpreted in case of 0 as read access denied and in case of 1 as read access granted.
n   Note that all values in between are also possible.

output

The value stored in the cell c(t) is squashed by function h, and whether this information gets output is decided by the output gate via o(t).

Training Process

Forward Pass

The cell state c is updated based on its current state and 3 inputs: az, ain , aout

Backward Pass

n   Errors arriving at cell outputs are propagated to the CEC
n   Errors can stay for a long time inside the CEC
n   This ensures non-decaying error
n   Can bridge time lags between input events and target signals

pros& cons

Advantages

Non-decaying error backpropagation.
For long time lag problems, LSTM can handle noise and continuous values.
No parameter fine tuning.
Memory for long time periods

Disadvantage

the cells states c(t) often tend to grow linearly during the presentation of a time series.

n   LSTM allows information to be stored across arbitrary time lags and error signals to be carried far back in time.
n   This potential strength however, can contribute to a weakness: the cells states c(t) often tend to grow linearly during the presentation of a time series.
n   If we present a continuous input stream, the cell state may grow in unbounded fashion causing saturation of the output squashing function h(c(t)).

Lec11 INTRODUCTION TO PRINCIPAL COMPONENT ANALYSIS (PCA)**

Eigenvalues 特征值 and Eigenvectors 特征向量

在矩阵的上下文中，特征值和特征向量与矩阵的行为密切相关。假设我们有一个方阵 A，特征值和特征向量的定义如下：

特征向量（Eigenvector）是一个非零向量 v，它在矩阵 A 的变换下，方向不变（即它只会被拉伸或压缩，但方向不会改变）。换句话说，当矩阵作用于该向量时，结果仍然是该向量的一个倍数。
特征值（Eigenvalue）是与特征向量对应的标量 λ，它表示当矩阵作用于该特征向量时，向量被拉伸或压缩的倍数。

Diagonal Decomposition 对角线分解

特征分解的意义在于将矩阵复杂的作用拆解为简单的几何变换（基变换、缩放和逆变换）。它不仅揭示了矩阵的核心性质，还为实际计算和应用提供了强有力的工具。通过特征分解，很多复杂的矩阵问题变得可分析和可计算。

Symmetric Diagonal Decomposition

Singular Value Decomposition 奇异值分解

是线性代数中一种非常重要的矩阵分解方法，它将一个矩阵分解为三个矩阵的乘积，并在很多应用中发挥了关键作用，比如信号处理、图像压缩、推荐系统、主成分分析（PCA）等。

Dimensionality Reduction

One approach to deal with high dimensional data is by reducing their dimensionality. Project high dimensional data onto a lower dimensional subspace using linear or non-linear transformations.

Find a basis in a low dimensional sub-space

Principal Component Analysis (PCA)

PCA的目标是降低数据的维数，同时尽可能多地保留数据集中存在的变化。

Motivation
Find bases which has high variance in data
Encode data with small number of bases with low MSE
怎么找呢
n First PC is direction of maximum variance
n Subsequent PCs are orthogonal to 1st PC and describe maximum residual variance

step

自己补充理解的，不在ppt里面

1. 准备输入数据，如进行均值中心化

2. 计算协方差矩阵

3. PCA 的核心步骤是对协方差矩阵 C 进行 特征分解：

C=QΛQT

Λ 是一个对角矩阵，其对角线上的元素是协方差矩阵的特征值，按从大到小排列。

Q 是特征向量组成的正交矩阵（列向量是单位正交的特征向量）。

4. 选择主成分

根据特征值大小选择前 kkk 个最大特征值对应的特征向量，组成一个新的矩阵 Qk：

Qk=[v1v2⋯vk]

我们选择 k 个主成分，使得这些方向上的方差最大，同时丢弃其他方向的信息。

5. 数据变换到主成分空间

将原始数据投影到选定的主成分方向上，得到降维后的数据矩阵：

Z=X′Qk

pro&con

Advantage
Reduce the dimension of the original data
Discard some information of the original data

Limitation
Discard some information of the original data
The meaning of the principal component
Linear model of PCA
Assume first PC has higher importance

Lec12 UNSUPERVISED LEARNING

Hebbian Learning 赫布型学习

When a neuron repeatedly excites another neuron, then the threshold of the latter neuron is decreased, or the synaptic weight between the neurons is increased, in effect increasing the likelihood of the second neuron to excite

There is no desired or target signal required in the Hebb rule, hence it is unsupervised learning

Oja's Rule

保持 Hebbian Learning 的本质：即权重会根据输入和输出的相关性进行调整。

限制权重的无限增长：通过对权重进行归一化处理，使其总能量（欧几里得范数）保持在一定范围内。

Oja’s rule creates a principal component in the input space as the weight vector when applied to a single neuron, similar to PCA

Weights increase without bounds. If initial weight is negative, then it will increase in the negative. If it is positive, then it will increase in the positive range

Hebbian learning is intrinsically unstable, unlike error-correction learning with BP algorithm

The weights need to be normalized to one as

逐元素展开：

通过deflation procedure计算其他特征向量的步骤：

首先计算第一个特征向量在输入上的投影

接着，将投影去掉，生成新的输入

对新的数据 x^再应用 Oja's Rule 来提取下一个特征向量

Auto-encoders

反向传播算法可用于无监督学习，以发现表征输入模式的重要特征。

它通过一个具有中心瓶颈层（bottleneck）的神经网络来学习输入数据的有效表示。这通常是在自编码器（autoencoder）中实现的。这种方法的核心思想是通过瓶颈限制强迫网络学习到输入数据的核心结构，而不是简单记住数据本身。

Auto-encoders 是一种无监督学习模型，主要用于数据的特征学习（feature learning）和数据压缩。它的目标是通过压缩和重建输入数据来提取有用的特征，从而实现特征提取、降维、去噪等功能

Autoencoder 是由 编码器（encoder） 和 解码器（decoder） 两部分组成的前馈神经网络（ feedforward neural network ）。其主要任务是将输入数据压缩成一个低维表示（通常是潜在空间，latent space），然后尽可能精确地重建原始输入（predict the input itself in the output）。结构上类似于一个自监督的网络，其中输出与输入相同，但没有明确的标签指导学习。

编码器：将输入数据通过一系列层逐步映射到潜在空间（低维空间）。

解码器：将潜在空间中的表示再映射回与输入相同维度的输出。

Unet就是自编码器变种！

An Autoencoder is a feedforward neural network that learns to predict the input itself in the output.

y(i) = x(i)

n The input-to-hidden part corresponds to an encoder
n The hidden-to-output part corresponds to a decoder.

Auto-encoders (Rumelhart 86)

To reproduce the input patterns at output layer
Number of hidden layers and the sizes of the layers can vary

Auto-encoder tends to find a data description which resembles the PCA; while small number of neurons in the bottleneck layer of the diabolo network acts as an information compressor (code)

Deep Auto-encoder(Hinton 06)

A deep auto-encoder is constructed by extending the encoder and decoder of autoencoder with multiple hidden layers.

Denoising Auto-encoder (Vincent 08)

By adding stochastic noise, it can force auto-encoder to learn more robust features.

Auto-encoders Network

- The network tries to reproduce the input in the output, inducing an short encoding in the hidden layer.
- This encoding retains the maximum amount of information about the input in a smaller dimensional space such that the input can be reconstructed.
- Auto-encoder networks can be used for dimensionality reduction, compression, etc.

Clustering

Cluster analysis organizes data by abstracting the underlying structure either as a grouping of individuals, or as a hierarchy of groups.

These groupings are based on measured or perceived similarities among the patterns.

application：stand-alone tool， preprocessing step

K-means*

The k-means algorithm partitions the data into k mutually exclusive clusters

Objective: minimize the sum of squared distance to its “representative object” in each cluster

steps

Competitive Learning*

是一种无监督学习方法，它的核心思想是网络中的多个神经元（或节点）通过竞争的方式共同学习输入数据的特征。每个神经元根据输入数据的特征进行调整，但只有最能代表当前输入的神经元（通常称为“胜者winner-takes-all (WTA) ”）被激活并更新权重。该方法主要应用于神经网络中的自组织特征学习，常用于聚类、模式识别和数据降维等任务。

   Competitive learning means that only a single neuron from each group fires at each time step
   Output units compete with one another.
   These are winner-takes-all (WTA) units
           Competition is important for neural networks
           Competition between neurons has been observed in biological nerve systems
           Competition is important in solving many problems

n   Units (active or inactive) are represented in the diagram as dots
        n   active units are represented by filled dots
        n   inactive ones by open dots
n   A unit in a given layer can
        n   receive inputs from all of the units in the next lower layer
        n   project outputs to all of the units in the next higher layer

Winner-takes-all (WTA)

an external, central arbitrator (a program) to decide the winner by comparing the current outputs of the competitors (break the tie arbitrarily)

Connections between layers are excitatory 刺激的
Connections within layers are inhibitory 抑制的
n each layer consists of a set of clusters of mutually inhibitory units
n the units within a cluster inhibit one another in such a way that only one unit per cluster may be active, and all others will lose

Simple Competitive Learning

所以为啥要和kmeans放一块讲呢？伟大的gpt老师也表示：
Unsupervised Competitive Learning is related to K-means clustering in the sense that both methods involve clustering and rely on the concept of a "winner" being selected based on a competitive process.

Each output unit moves to the center of mass of a cluster of input vectors → clustering！

attention! The clusters formed are highly sensitive to the initial weight vectors.

Process

Intuitively clear, plausible procedure
• places prototypes in areas with high density of data
• identifies the most relevant combinations of features

Enforcing fairer competition

n   Initial position of weight vector of an output unit may be in region with few (if any) patterns
n   Some units may never or rarely become a winner, and so weight vector may not be updated, thus preventing it finding richer part of pattern space → DEAD UNIT
n   More efficient to ensure a fairer competition where each unit has an equal chance of representing some part of training data

Leacky(漏) learning

Modify weights of both winning and losing units but at different learning rates

Maxnet

Lateral inhibition 侧向抑制

神经元之间的一种抑制作用，兴奋的神经元能够减少其邻居的活动能力。

output of each node feeds to others through inhibitory connections (with negative weights)

Lateral inhibition between competitors

Mexican Hat Network

We mainly deal with single winner WTA, but multiple winners WTA are possible

n The lateral connections produce excitatory or inhibitory effects, depending on the distance from the winning neuron
n This is achieved by the use of a Mexican Hat function, which describes synaptic weights between neurons in the output layer.

Architecture
For a given node,
        n   close neighbors: cooperative (mutually excitatory , w > 0)
        n   distant neighbors: competitive (mutually inhibitory, w < 0)
        n   too far away neighbors: irrelevant (w = 0)

Need a definition of distance (neighborhood):
n one dimensional: ordering by index (1,2,…n)
n two dimensional: lattice

e.g

Vector Quantization

感觉不会考，复习ppt里面没有

K-means can be used to construct the codebook

Lec13 SELF-ORGANIZING FEATURE MAP**

Topographic Maps拓扑映射

将竞争性学习competitive learning的思想扩展，加入输入和神经元之间的邻域关系。

拓扑映射的目标是非线性地转换输入模式空间到输出特征空间，并且在转换过程中保持输入之间的邻域关系。

拓扑映射的核心思想：特征映射（Feature Map），即输出空间中的神经元应该对相似的输入做出反应。

神经元会选择性地调整（tune）对特定输入模式的反应，从而使得神经元之间的相对关系变得有序。这种有序性意味着不同输入特征在输出空间中会形成一个有意义的坐标系统。

Self-Organizing Map (SOM)

一种无监督学习算法，通常用于数据的聚类和降维。主要目的是通过自适应方式将高维数据映射到低维（通常是二维）网格上，同时保留数据的拓扑结构。是一种基于**竞争学习（competitive learning）**的神经网络模型。

Kohonen’s self-organizing map (SOM) algorithm
n more general as it can perform dimensionality reduction
n SOM can be viewed as a vector quantization type algorithm

The idea in an SOM is to transform an input of arbitrary dimension into a 1 or 2 dimensional discrete map

Property

n   Approximate the input space
n   Topological ordering
n   Density matching
n   Feature selection (features of the underlying distribution)

process

Initialization

Grid: size and structure fixed a priori (most of the times, 2-dimensional grid are used)

Competition

Given an input pattern, outputs compete to see who is winner based on a discriminant function (e.g. similarity of input vector and weight vector) A continuous input space of activation patterns is mapped onto a discrete output space of neurons by a process of competition among the neurons in the network.

Cooperation

Winning neuron determines spatial location of a topological neighborhood within which output neurons excited

Neighborhood

大的全局好，小的局部好

n Ordering phase (large neighborhood)
n Convergence phase (small neighborhood)

Synaptic Adaptation

Excite neurons adapt weights so that value of discriminant function increases (a similar input would result in enhanced response from winner)

The topological neighborhood hj,i

symmetric around the winning neuron and achieveits maximum value at the winning neuron
the amplitude decreases monotonically with theincreasing lateral distance

Neighbors of the winning node are also allowed to update, even if they are not close to winning!

Learning Rate

Property

n   Approximate the input space
n   Topological ordering
n   Density matching
n   Feature selection (features of the underlying distribution)

Vector Quantization

In VQ techniques, a number of local Voronoi centers are formed to represent input vectors.

n For a set of M reference vectors, {w1,…,wM}, an input vector x is considered being best matched by wk in the sense that an appropriately defined distortion measure such as the squared Euclidean distance ||x-wk||2 is minimal.
n The reference vectors partition the input space RL into the Voronoi cells/polygons defined as

Learning Vector Quantizer

LVQ is a supervised learning technique that uses class information to move the Voronoi vectors slightly, so as to improve the quality of the classifier decision regions.

Vector Quantization 是一种将输入空间划分为多个 Voronoi 区域的技术，每个区域由一个 Voronoi Center 表示。

VQ 的核心在于：

确定 Voronoi Centers 的位置。
利用最近邻规则对输入向量进行分配和量化。

Stopping criteria
n   Codebook vectors have stabilized
n   Or maximum number of epochs has been reached
Advantages
n   Appear plausible, intuitive, flexible
n   Fast and easy to implement
n   Frequently applied in a variety of problems involving the classification of structured data
Disadvantages
n   Not stable for overlapping classes
n   Very sensitive to initialization

看起来很像聚类，但是事实上LVQ 是一种 有监督学习 技术， 聚类算法 是 无监督学习 技术

LVQ1

LVQ1 是一种较简单的有监督学习方法，它只根据一个原型向量和输入向量的标签一致性来进行更新。

n   An input vector x is picked at random from the input space.
n   If the class labels of the input vector x and a Voronoi vector w agree, the Voronoi vector w is moved in the direction of the input vector x.
n   If the class labels of the input vector x and the Voronoi vector w disagree, the Voronoi vector w is moved away from the input vector x.

LVQ2

LVQ2 则进一步改进了这一过程，它不仅考虑正确类别的原型更新，还考虑错误类别的原型更新，使得原型更新更加精细和精确，尤其在多类别和复杂分类任务中，LVQ2 更具有优势。

•   Initialize prototype vectors (for different classes)
•   Present a single example
•   Identify closest correct and closest wrong prototypes
•   Move the corresponding winner towards / away from the example

Lec14 HOPFIELD NETWORK**

ASSOCIATIVE MEMORIES

An associative memory is a content- addressable structure that maps a set of input patterns to a set of output patterns.

Associative memory is often linked to pattern association

是一种存储信息的方式，其中数据以键-值对的形式存储，键是用于检索的提示，值是与之关联的输出。

classification

Auto-association 自联想记忆- closely resembles

输入和输出是相同的。
目标是从部分或损坏的输入重建完整的记忆。
示例: Hopfield 网络，用于纠正受损的图像或信号。

Hetero-association异联想记忆 -different

输入和输出不同。
用于将一种模式或信号映射到另一种相关模式。
示例: 用一个特定的输入信号（如单词）联想出其对应的含义或翻译。

Simple AM

Network structure: single layer
` one output layer of non-linear units and one input layer
` similar to the simple network for classification

Goal of learning:
        n   to obtain a set of weights w_ij
        n   from a set of training pattern pairs {s :t}
        n   such that when s is applied to the input layer, t is computed at the output layer

Similar to Hebbian learning for classification
Algorithm: (bipolar or binary patterns)

For each training samples s:t Δwij=si * tj
Δwij increases if both input and output are ON(binary) or have the same sign (bipolar)

Instead of obtaining W by iterative updates, it can be computed from the training set by calculating the outer product of s and t.

HOPFIELD NETWORK

用于实现 自联想记忆（Auto-associative Memory） 和优化问题的求解。

Hopfield 网络的目标是存储一组模式，并且在给定部分模式或损坏模式的情况下，通过网络的动态演化过程恢复出原始模式。

结构:
- 网络由一组神经元（节点）组成，每个神经元与所有其他神经元相连（全连接）。
- 没有自连接，即每个神经元不会连接到自身。
状态更新:
- 每个神经元的状态是二元的，通常表示为 +1 或 −1（也可表示为 0 和 1）。
- 网络的状态通过反复迭代更新，直到达到稳定的状态（即能量最小的状态）。

Discrete Hopfield Network

n The stable state is determined by the weight matrix W, the current input vector X, and the threshold matrix q. If the input vector is partially incorrect or incomplete, the initial state will converge into the stable state after a few iterations.
经过我和我朋友的确认ppt里面算的就是有问题，大家稍安勿躁，不要怀疑自己，骂骂老师得了喵