# What is the gradient algorithm

Linear regression, also known as linear regression, is the regression represented by a straight line as opposed to curved regression. When the dependent variable Y is against the independent variable X.1、 X2... If the regression equation of Xm is a linear equation, namely μy＝ β0 + β1X1 + β2X2 +… ΒmXm, Where β0Is a constant term, βiIs the independent variable X.iThe regression coefficient of M is any natural number. Then call Y against X.1、 X2、… 、 XmThe regression is a linear regression.

Simple regression:

Linear regression with only one independent variable is called simple regression, as shown in the following example:

X stands for the quantity of a certain product, Y for the total price of these various quantities of goods

x = [0, 1, 2, 3, 4, 5]

y = [0, 17, 45, 55, 85, 99]

The drawing in two-dimensional coordinates is as follows:

Now if the number of goods X = 6, what is the total estimated price of the goods?

We can clearly see that the total price of goods increases as the number of goods increases, which is a typical linear regression.

Since there is only one independent variable X, we assume a linear regression model: Y = a * X + b

We need to find the most appropriate values ​​for a and b so that the straight line: Y = a * X + b fits the trend in the figure above, and then we can predict the total price Y among different quantities of goods X.

Least Squares Method:

To find the most suitable a b, we introduce the least squares method.

Least Squares Method, also known as Least Squares Estimation. A common method for estimating population parameters from sample observations. It is used to observe data from n pairs (x)1， Y1) ， (X2， Y2), ..., (xn, yn) determine the best estimate of the correspondence between x and y, y = f (x) such that the sum of the squares H is the difference (i.e., deviation) between the observed value and the estimated value the smallest is.

Least squares method can eliminate the influence of random errors as much as possible to get the most reliable and likely result from a set of observational data.

From the above figure we can clearly see that the straight line Y = a * X + b goes through the origin, i.e. b = 0

We tried different values ​​of a and got the following results:

when a = 19 H = 154

H = 85 when a = 20

when a = 21, H = 126

The pictures are as follows:

We can roughly conclude that when a = 20 and b = 0, the linear model Y = 20 * X fits the sample data better.

So if the quantity of goods is X = 6, we can roughly estimate the total price Y = 20 * 6 = 120

Multiple regression:

Linear regression with more than one independent variable is called multiple regression.

The above example is just one independent variable, it is relatively easy to handle, but when there are many independent variables we assume that there are m independent variables, like [x1, x2, x3, x4.....xm]

At this point, the regression coefficients (i.e. weights) we assumed must also have m, i.e. the linear model we assumed is Y = X.0 + X1* W1 + X2* W2 + X3* W3 + ....... + Xm* Wm

To simplify the calculation, we go to W.0 = 1

In this way: Y = X.0* W0 + X1* W1 + X2* W2 + X3* W3 + ....... + Xm* Wm

Written in vector form:

W = [W0, W1 , W.2 ,W.3 , ...., Wm]

X = [X0, X1 , X2 , X, ...., Xm]

Y = WT * X （WTIs the transpose of the vector W)

The sum of the squares of the difference (i.e., deviation) between the observed value and the estimated value:

To make the calculation easier later, we multiply the left side of H by half, namely:

In the above formula, n stands for the number of training patterns, m for the number of features (independent variables) of each training pattern.The subscript means that it belongs to the jth sample, and the subscript means the ith characteristic (value of the independent variable).Represents the observation value of the jth total sample price

Now, H is a function of W0, W1, W2 ... Wm. We need to find the most suitable value of W by some suitable method in order to get a better linear regression equation. Compared to simple regression, it is difficult for us to observe and try out different values ​​of w. We have to use an optimization algorithm.

Common optimization algorithms include Gradient Descent, Newton's Method and Quasi-Newton's Methods, Conjugate Gradient and Heuristic Optimization Methods. In this article, the gradient algorithm is presented in detail. .

Clarify our current goal: we have to find out through the gradient algorithm --- if H is the smallest, W.0 , W1 ,W.2 ,W.3 , ......., WmWrite the regression equation.

The gradient algorithm is divided into gradient ascent algorithm and gradient descent algorithm. The basic idea of ​​the gradient descent algorithm is: in order to find the minimum value of a function, it is best to search along the gradient direction of the function, and the opposite is true for gradient ascent. For a function f (x, y) with two unknowns x, y the gradient is expressed as:

For Z = f (x, y) using the gradient descent algorithm means moving along the X-axisMove towards Y.The function f (x, y) must be defined and differentiable at the point to be calculated.

It can be understood as:

The gradient is actually the direction in which the function value changes fastest. For example, if you are standing on a mountain, the direction indicated by the grade is the direction with the fastest change in elevation. Walking in this direction is the quickest way to change (increase or decrease) the height of your position. However, if you happen to walk, the altitude of your position may not change significantly after a long walk. In other words, the quickest way to get to a certain top or bottom of the mountain is by walking further down the slope. In fact, the gradient algorithm is used to look for local minima or maxima. It is a very efficient, fast and reliable method in practical applications.

Use gradient descent to find the minimum H.

We saw earlier:

H is approximately W = [W.0 , W1 ,W.2 ,W.3 , ......., Wm] Function is the gradient of H as follows:

At this point for each Wi gradient:

We assume that the step size of each update along the gradient direction is α, so the value update formula of W can be written as:

So the pseudocode of the gradient descent algorithm is as follows:

Each value of each regression coefficient (i.e. each W value) is 1

Repeat R times:

Calculate the gradient of the entire data set

Use Update the regression coefficient W.

Examples:

Use the gradient descent algorithm to find the linear regression equation of the following commodity data

We assume that the linear regression model is the total price Y = a + b * X1 + c * X(X1 X2 represent the quantity of goods 1 or 2)

We have to ask for the regression coefficient W = [a, b, c].

The gradient descent algorithm is as follows:

1import numpy as np 2 3def grad_desc (train_data, train_labels): 4 "" "Gradient descent" "" 5 data_mat = np.matrix (train_data) 6 label_mat = np.matrix (train_labels) .transpose () 7 n = np.shape ( data_mat) [1] 8 # steps 9 alpha = 0.001 10 # maximum number of cycles11 max_cycles = 100 12 # initialize regression coefficient weights13 weights = np.ones ((n, 1)) 14for index in range (max_cycles): 15 h = data_mat * weights-label_mat 16 weights = weights - alpha * data_mat.transpose () * h 17 # Returns the flattened array of coefficients 18return np.asarray (weights) .flatten ()

The regression coefficient we get with the above algorithm is

[ 1.7218815 4.24881047 5.28838946]
In the gradient algorithm above, the loop is R = 100 times, and each update of the regression coefficient must go through the entire data set. If the data sample is large, the complexity of the computation time is very high.
Therefore, one sampling point is generally used to update the regression coefficients at the same time, which is called the stochastic gradient algorithm.
The pseudocode of the stochastic gradient descent algorithm is as follows:
All regression coefficients are initialized to 1
Calculate the gradient of the sample

use Update regression coefficients W.

The revised algorithm is as follows:

1import numpy as np 2 3def advanced_random_grad_desc (train_data, train_labels): 4 "" "Improvement of the stochastic gradient drop" "" 5 data_mat = np.asarray (train_data) 6 label_mat = np.asarray (train_labels) 7 m, n = np.shape (data_mat) 8 # steps 9 alpha = 0.001 10 # initialize regression coefficient weights 11 weights = np.ones (n) 12 max_cycles = 500 13for j in range (max_cycles): 14 data_index = list (range (m)) 15for i in range (m ): 16 random_index = int (np.random.uniform (0, len (data_index))) 17 h = sum (data_mat [random_index] * weights) -label_mat [random_index] 18 weights = weights - alpha * h * data_mat [random_index ] 19del data_index [random_index] 20return weights

The calculated regression coefficient is:

[ 1.27137416 4.31393524 5.2757683 ]

We can get the linear regression equation as:

Y = 1.27 + 4.31 * X1 + 5.28 * X2

What follows: