Home CS231n Computer Vision
Post
Cancel

CS231n Computer Vision

slides: http://cs231n.stanford.edu/slides/2017/

Lecture 2 —— Image Classification

Nearest Neighbor Classifier(最近邻分类器)

在训练集上寻找最接近的目标

  • Memorize all data and labels
  • Predict the label of the most similar training image

Time Complexity: Train O(1), Predict O(n), but we want it to be slow at training time but fast at testing time.

Distance Metic to compare images

L1 distance(Manhattan distance)

Utilize vectorized operations in numpy,与坐标轴有关,进行坐标变换的话需要重新计算

\[d_1(I_1, I_2) = \sum_p \mid I_1^p - I_2^p \mid\]

L2 distance(Euclidean distance)

\[d_2(I_1, I_2) = \sqrt{\sum _p( I_1^p - I_2^p)^2}\]

K-Nearest Neighbors Algorithm —— KNN

Instead of copying label from nearest neighbor, take majority vote from K closest points.

Benefits: Smooth out decision boundaries and lead to better results, and be more robust

hyperparameter

Chosen parameter about your training algorithm, which can not be learned directly from the data, and should be set before training

Cross Validation(交叉验证)

Suitable to small data sets

Benefits: More robust and beiliving

k折意味着把training set分成k份, 并且训练k轮

Linear Classifier

$y = f(x, W)$,see next lecture

Softmax Classifier (Multinomial Logistic Regression)

Loss Function:$L_i = -log(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}})$,see next lecture

Lecture 3 —— Loss Function and Optimization

Loss Function

A Loss Function tells how good our current classifier is (量化结果):

  • Givena dataset of examples $\lbrace(x_i, y_i)\rbrace^N_{i=1}$, where $x_i$ is image and $y_i$ is (integer) label
  • Loss over the dataset is a sum of loss over examples:
\[L(W) = \frac{1}{n}\sum_i^nL_i(f(x_i, W), y_i)\]

Multiclass SVM loss

一般的SVM是针对0、1两类标签,现在是把它拓展到n类标签。它的物理意义是:现在要预测一个样本的标签,根据之前训练出的权重求出这个样本在所有标签的得分,正确的标签的得分如果大于其他标签的得分(往往还会加一个safety margin,就是要求要足够大),则loss function不增加;否则loss function就会增加其他标签的得分超过正确标签的得分的差值。

Given an example ($x_i$, $y_i$) where $x_i$ is the image and where $y_i$ is the (integer) label, and using the shorthand for the scoers vector:

$s = f(x_i , W)$, and we represent the safety margin as $\alpha$. The SVM loss has the form:

\[L_i = \begin{cases} 0, & \text{if $s_{y_i}\geq s_j+\alpha$} \\ s_j-s_{y_i}+\alpha, & \text{otherwise} \end{cases} = \sum_{j\neq y_i} max(0, s_j - s_{y_i}+\alpha)\]

Hinge loss plot:

  • The domain of the loss $L_i$: [0, +$\infty$]

  • 很多时候我们jiggle一下数据(change a little bit),$L_i$可能不变

  • 当刚开始训练时,$s$值一般都很小而且接近0,所以应该会有:$L_i \approx \alpha \times (classNumber-1)$

Square Hinge loss: $L_i = \sum_{j\neq y_i} max(0, (s_j - s_{y_i}+\alpha)^2)$,可以放大很糟糕的情况,看情况选用

numpy Example Code: $L_i =\sum_{j\neq y_i} max(0, s_j - s_{y_i}+\alpha)$

1
2
3
4
5
6
def L_i_vectorized(x, y, W):
	scores = W.dot(x)
	margins = np.maximum(0, scores - scores[y] + 1)
	margins[y] = 0 # skip the one class
	loss_i = np.sum(margins)
	return loss_i

Regularization —— 正则化

Occam’s Razor: Amoung competing hypotheses, the simplest is the best

同样的loss值会对应很多组不同的权重W,正则化描述了对参数的某种偏好。为了防止曲线过拟合或者high degree,我们可以加入一个正则项$\lambda R(W)$,可以简单理解为是一个Regularization penalty,可以迫使模型选择更低阶的多项式,也就是说如果模型需要选择更高阶的表达式,则会收到较高的惩罚。

可以这么理解正则化:比如用多项式拟合数据,有两种方式抑制过拟合,一种是直接限定多项式的次数,另一种是不限定次数,但是在loss function里增加跟次数相关的一项,它会使算法更倾向于找低次数的多项式。正则化就是后一种方式。正则化可以帮助解决过拟合的问题。Model should be “simple”, so it works on test data.

Full loss function:

\[L(W) = \frac{1}{n}\sum_i^nL_i(f(x_i, W), y_i)+\lambda R(W)\]
  • Data loss: $L_i(f(x_i, W), y_i)$
  • Regularization loss: $\lambda R(W)$
  • $\lambda$: regularization strength (hyper parameter)

In common use

L2 Regularization: $R(W) = \sum_k\sum_lW^2_{k,l}$, euclidean norm of the weight vector $W$(W的欧几里得范式),可以乘$\frac{1}{2}$以减小导数

L1 Regularization: $R(W)=\sum_k\sum_l\mid W_{k, l} \mid$,鼓励稀疏

Elastic net (L1+L2): $R(W) = \mid W_{k,l} \mid+ \sum_k\sum_l\beta W^2_{k,l}$

Softmax Loss

Common in deep learning

scores: unnormalized log probabilities of the classes

Softmax function: $\frac{e^{s_k}}{\sum_je^{s_j}}$,用$e^x$来保证positive,然后进行归一化处理:

\[归一化normalization:P(Y=k \mid X=x_i)=\frac{e^{s_k}}{\sum_je^{s_j}}, \quad where\quad s = f(x_i, W)\]

so we can get the loss function(极大似然估计求解):

\[L_i = -logP(Y=y_i\mid X=x_i)\]

in summary:

\[L_i = -\log(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}})\]

Domain: [0, +$\infty$]

实际上不容易出现loss为0的情况,因为这意味着Softmax function对于correct class的评分为1,也就是说不正确类和正确类的score差值非常大,需要有无穷的差距。同时,也不会出现correct class’s sofrmax function为0的情况,因为这意味着correct class score为$-\infty$,一般来说我们不考虑计算机会表达无穷这件事。

Difference between SVM and Softmax

What SVM Loss case about is getting the correct score to be greater than a margin above the incorrect scores. But Softmax Loss always wants to drive the probability max all the way to one, even you have high score on correct classes as well as low score on incorrect classes, softmax want you to pile more and more probability mass on the correct class, and continue to push the score of the correct classes up towards infinity, and the score of the incorrect classes down towards minus infinity.

Optimization

useless

Follow the slope

In 1-dimension, the derivative of a function:

\[\frac{df(x)}{dx} = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}\]

In multiple dimensions, the gradient(梯度) is the vector of partial derivatives along each dimension.

The slope in any direction is the dot product of the direction with the gradient.

The direction of steepest descent is the negative gradient.

Numerical gradient:

利用limit求解

approximate, slow when the parameter vector become large in deep learning, but easy to write

Analytic gradient:

通过微积分提前求解偏导解析式, only compute once, exact, fast, error-prone (!!!)

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check

Gradient Descent

1
2
3
4
# Vanilla Gradient Descent
while True:
	weights_grad = evaluate_gradient(loss_fun, data, weights)
	weights += -step_size * weights_grad  # perform parameter update

梯度指向最快上升方向,所以取minus走向相反方向

step_size: hyper parameter, sometimes called learning rate

同时,如果dataset的规模过大,我们每次只选取部分进行Loss估计(Monte Carlo estimation), minibatch of examples 32 / 64/ 128 common

1
2
3
4
5
# Vanilla Minibatch Gradient Descent
while True:
	data_batch = sample_training_data(data, 256)  # sample 256 examples
	weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
	weights += -step_size * weights_grad  # perform parameter update

Feture Transform

Bag of Words

1
2
3
4
5
6
7
8
9
10
11
12
def kp(l, r):
	if(l==r) return
	temp = arr[i]
	L = l, R = r
	while(L<r)
		while(L<R && arr[R]>=temp) R--
		arr[L] = arr[R]
		while(L<R && arr[L]<=temp) L++
		arr[R] = arr[L]
	arr[L] = temp
	kp(l, L-1)
	kp(L+1, r)

Lecture 4 —— Backproagation and Neural Networks

Way to compute gradient, using chain rule.

\[\frac{\partial f}{\partial x} = \sum_i\frac{\partial f}{\partial q_i}\times \frac{\partial q_i}{\partial x}\]

Jacobian Matrix

Neural Networks

Linear score function:$f = Wx$

2-layer Neural Network:$f=W_2max(0, W_1x)$, $h=max(0, W_1x)$,and $h$ represents hidden layer

or 3-layer Neural Network:$f=W_3max(0, W_2max(0, W_1x))$

max函数使其具有非线形能力,能够拟合更复杂的模型

Lecture 5 —— Convolutional Neural Networks

相关博客

$1\times1$的卷积核用于改变深度

Pooling Layer

降采样

(1)保留主要特征的同时减少参数和计算量,防止过拟合。

(2)invariance(不变性),这种不变性包括translation(平移),rotation(旋转),scale(尺度)。

(3)makes the representations smaller and more manageable

(4)operates over each activation map independently

池化一般不针对深度

对于池化层,我们一般不让卷积核移动时发生重叠(设置步长),一般不填0

Common Setting:

  • F=2*2, S=2
  • F=3*3, S=2

Max Pooling —— 最大池化法

卷积核取块内最大值,better than average pooling,因为我们在特征问题上可以重点关注激活值最大的单元

Fully Connected Layer(FC layer)

Cotains neurons that connect to the entire input volume, as in ordinary Neural Networks

Details of traning Neural Networks

Mini-batch SGD —— 最小批量随机梯度下降

Loop

  1. Sample a batch of data
  2. Forward prop it through the graph(network), get loss
  3. Backprop to calculate the gradients
  4. Update the parameters using the gradient

Activation Functions

Sigmoid

expression: $\frac{1}{1+e^{-x}}$, range [0, 1]

problem:

  • 梯度可能平滑,若传入x的值太大或者太小,可能无法得到梯度流的反馈(梯度消失),神经元饱和
  • Not zero-centered,导致所有w的梯度只能同时是正或者负,这样更新的效率太低
  • exp()计算代价太高

tanh

expression: $tanh(x)$, range [-1, 1]

zero-centered,不过依然有梯度消失问题

ReLU —— Rectified Linear Unit

expression: $max(0, x)$, LU(Linear Unit)

在x为正的区域不会出现饱和问题,同时计算成本低,收敛快

problem:

  • Not zero-centered
  • 在一半区间内会出现梯度消失问题(deal ReLU)

Leaky ReLU

expression: $max(0.1x, x)$

PReLU —— Parametric Rectifier

Expression: $max(\alpha x, x)$

ELU —— 指数线性单元

expression: $\begin{cases} x, & \text{if $x\geq0$} \ \alpha(e^x-1), & \text{otherwise}
\end{cases}$

Maxout

expression: $max(w^T_1x+b_1, w^T_2x+b_2)$

  1. 泛化性能好
  2. Generalizes ReLU and Leaky ReLU

  3. Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron : (

In practice

  • Use ReLU. Be careful with your learning rates
  • Try out Leaky ReLU / Mahout / ELU
  • Try out tanh but don’t expect much
  • Don’t use sigmoid

在视觉领域,一般只做0均值化(zero mean pre-processing)

初始化权重如果全是0,就会得到完全一样的神经元

First idea: Small random numbers

\[W=0.01\times np.random.randn(D, H)\]

okay for small networks, but problems with deeper networks(小数字可能梯度较小,层数多了的话累乘得到梯度可能很小,基本无法更新)

初始化权重过大,网络饱和;初始化权重过小,网络崩溃

Xavier initialization

我们希望输入的方差=输出的方差

查阅资料

Batch Normalization —— 批量归一化

Compute the empirical mean and variance independently for each dimension, and hen normalize: \(\hat x^{(k)} = \frac{x^{(k)}-E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}\) Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

And then allow the network to suash the range if it wants to: \(y^(k)=\gamma^{(k)} \hat x^{(k)}+\beta^\)

Note: the network can learn $\gamma = \sqrt {Var[x^{(k)}]}$ and $\beta^{(k)}=E[x^{(k)}]$ to recover the identity mapping

改进梯度流,具有更高的鲁棒性,能够在更广范围的学习率和不同初始值下工作,防止梯度过饱和

单位高斯归一化,zero-centered and std = 1

Hyperparameter Optimization

  1. Only a few epochs to get rough idea of what params work
  2. Longer running time, finer search

随机取样法,网格法

This post is licensed under CC BY 4.0 by the author.

2023.2.25 Solution

动手学深度学习v2

Comments powered by Disqus.