## 2. 单变量线性回归

$h_{\theta}$(x) = $\theta_{0}$ + $\theta_{1}$

## 3. Cost Function

$$\frac1{2m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}$$ 最小的 $\theta$ 值。

$$J(\theta_0, \theta_1) = \frac1{2m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}$$

#### Cost Function 的作用

$$J(\theta_1)= \frac1{2m}\sum_{i=1}^{m}{(\theta_1x^{(i)} - y^{(i)})^2}$$

## 4. 梯度下降法

• Start at 0,0 (or any other value)
• Keeping changing $\theta_0$ and $\theta_1$ a little bit to try and reduce J($\theta_0$, $\theta_1$)
• Each time you change the parameters, you select the gradient which reduces J($\theta_0$, $\theta_1$) the most possible
• Repeat
• Do so until you converge to a local minimum
Has an interesting property
• Where you start can determine which minimum you end up
• Here we can see one initialization point led to one local minimum
• The other led to a different one

### 4.1 具体的计算过程

$$\theta_j := \theta_j - \alpha \frac\partial{\partial\theta_j}J(\theta_0, \theta_1)$$
(for j = 0 and j = 1)

### 4.2 Notation

$\alpha$

• Is a number called the learning rate
• Controls how big a step you take
• If α is big have an aggressive gradient descent
• If α is small take tiny steps
• Too small
• Take baby steps
• Takes too long
• Too large
• Can overshoot the minimum and fail to converge

### 4.3 Computer

$$temp0:= \theta_0 - \alpha \frac\partial{\partial\theta_0}J(\theta_0, \theta_1)$$
$$temp1:= \theta_1 - \alpha \frac\partial{\partial\theta_1}J(\theta_0, \theta_1)$$
$$\theta_0 := temp0$$
$$\theta_1 := temp1$$

#### 4.4 利用梯度下降法求解线性回归问题

$$\frac\partial{\partial\theta_j}J(\theta_0, \theta_1)$$

$$=\frac\partial{\partial\theta_j} * \frac1{2m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})^2}$$

$$=\frac\partial{\partial\theta_j} * \frac1{2m}\sum_{i=1}^{m}{(\theta_0 +\theta_1x^{(i)} - y^{(i)})^2}$$

j = 0:
$$\frac\partial{\partial\theta_0}J(\theta_0, \theta_1) = \frac1{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})$$
j = 1:
$$\frac\partial{\partial\theta_1}J(\theta_0, \theta_1) = \frac1{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})*x^{(i)}$$

### 4.5 梯度下降法的证明

1、如果优化函数存在解析解。例如我们求最值一般是对优化函数求导，找到导数为0的点。如果代价函数能简单求导，并且求导后为0的式子存在解析解，那么我们就可以直接得到最优的参数。

2、如果式子很难求导，例如函数里面存在隐含的变量或者变量相互间存在耦合，互相依赖的情况。或者求导后式子得不到解释解，或者未知参数的个数大于方程组的个数等。这时候使用迭代算法来一步一步找到最优解。

• 当目标函数是凸函数时，梯度下降法的解是全局最优解
• 一般情况下，其解不保证是全局最优解

#### 凸函数

$$f(\frac{x_1+x_2}{2}) \le \frac{f(x_1)+f(x_2)}{2}$$

$$f(px_1+(1-p)x_2) \le pf(x_1)+(1-p)f(x_2)$$