## 1. 多变量的线性回归

n: 特征(features) 数量
m: 训练集数量
$x^{(i)}$:

• 表示一条训练数据的向量
• i is an index into the training set
• So
• x is an n-dimensional feature vector
• $x^{(3)}$ is, for example, the 3rd training data

$x^{(j)}_i$: The value of feature j in the ith training example

$$h_θ(x) = θ_0 + θ_1x_1 + θ_2x_2 + θ_3x_3 + θ_4x_4$$

For convenience of notation, $x_0$ = 1, 所以最后的特征向量的维度是 n+1，从 0 开始，记为”X”，

$$h_θ(x)=θ^TX$$
$θ^T$: [1 * (n+1)] matrix

### Cost Function

$$J(θ_0, θ_1, …,θ_n) = \frac1{2m}\sum_{i=1}^{m}{(h_θ(x^{(i)}) - y^{(i)})^2}$$

Repeat {
$$θ_j = θ_j - α\frac\partial{\partial J(θ_0, θ_1, …,θ_n)}$$
}

every iterator

• θj = θj - learning rate (α) times the partial derivative of J(θ) with respect to θJ(…)
• We do this through a simultaneous update of every θj value

$$\frac\partial{\partial J(θ_0, θ_1, …,θ_n)}$$
$$= \frac1m * \sum_{i=1}^{m}{(h_θ(x^{(i)}) - y^{(i)})}*x_j^{(i)}$$

## 2. Gradient Decent in practice

### Learning Rate α

• working correctly: If gradient descent is working then J(θ) should decrease after every iteration
• convergence: 收敛是指每经过一次迭代，J(θ)的值都变化甚小。
• choose α
1. When to use a smaller α
• If J(θ) is increasing, see below picture
• If J(θ) looks like a series of waves, decreasing and increasing again
• But if α is too small then rate is too slow
2. Try a range of α values
• Plot J(θ) vs number of iterations for each version of alpha
• Go for roughly threefold increases: 0.001, 0.003, 0.01, 0.03. 0.1, 0.3

## 3. Normal equation 求解多变量线性回归

### 3.1 Normal equation

$$J(θ_0, θ_1, …,θ_n) = \frac1{2m}\sum_{i=1}^{m}{(h_θ(x^{(i)}) - y^{(i)})^2}$$

$$\frac\partial{\partial θ_j}J(θ_0, θ_1, …,θ_n) = … = 0$$，其中 j = 0,1,2,…,n

$$θ = (X^TX)^{-1}X^Ty$$

## 4. Gradient descent Vs Normal equation

• Need to chose learning rate
• Needs many iterations - could make it slower
• Works well even when n is massive (millions)
• Better suited to big data
• What is a big n though: 100 or even a 1000 is still (relativity) small, If n is 10000 then look at using gradient descent
• 适用于线性回归会逻辑回归

### Normal equation

• No need to chose a learning rate
• No need to iterate, check for convergence etc.
• Normal equation needs to compute $(X^TX)^{-1}$
• This is the inverse of an n x n matrix
• With most implementations computing a matrix inverse grows by O(n3), So not great
• Slow of n is large, Can be much slower
• 仅适用于线性回归

## 5. 局部加权线性回归

$$J(\theta) = \sum_{i=1}^{m} w^{(i)}( y^{(i)}-\theta^Tx^{(i)} )^2$$

$$w^{(i)} = exp (-\frac{(x^{(i)}-x)^2}{\tau^2})$$

$w^{(i)}$的形式跟正态分布很相似，但二者没有任何关系，仅仅只是便于计算。可以发现，$x^{(j)}$ 离 $x^{(i)}$ 非常近时，${w^{(i)}_j}$ 的值接近于1，此时 j 点的贡献很大，当 $x^{(j)}$ 离 $x^{(i)}$ 非常远时，${w^{(i)}_j}$ 的值接近于 0，此时 j 点的贡献很小。

$\tau^2$ 是波长函数(bandwidth)， 控制权重随距离下降的速度，τ 越小则 x 离 $x^{(i)}$ 越远时 $w^{(i)}$ 的值下降的越快。

• 每次对一个点的预测都需要整个数据集的参与，样本量大且需要多点预测时效率低。提高效率的方法参考 Andrew More’s KD Tree
• 不可外推，对样本所包含的区域外的点进行预测时效果不好，事实上这也是一般线性回归的弱点

## Reference

http://www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html