4. Regression Analysis

1. The principle of least squares

Let u be a function of variables x , y ,... , with m parameters a 1 , a 2 ,..., a m , namely

u =f ( a 1 , a 2 ,..., a m ; x , y ,...)

Now make n observations of u and x , y , . . . ( x i , y i , . . . ; u i )( i = 1,2, . . . , n ) . So the absolute error between the theoretical value of u and the observed value ui is

( i = 1,2,..., n )

The so-called least squares method is to require the above n errors in the sense of the smallest sum of squares, so that the function u =f ( a 1 , a 2 ,..., a m ; x , y ,... ) and the observed value u 1 , u 2 , ... , u n is the best fit. That is, the parameters a 1 , a 2 ,... , a m should be

minimum

According to the method of finding the extreme value of differential calculus, it can be known that a 1 , a 2 , ··· , a m should satisfy the following equations

( i = 1,2,..., n )

2. Univariate Linear Regression

[ Univariate regression equation ]   The observed value corresponding to the independent variable x and the variable y is

If there is a linear relationship between the variables, a straight line can be used

to fit the relationship between them. By the least squares method, a , b should be

minimum

have to

in the formula

The equation is called the regression equation (or regression line), and b is called the regression coefficient.

[ Correlation coefficient and its test table ] The   correlation coefficient r xy reflects the closeness of the linear relationship between the variables x and y , which is defined by the following formula

in

(In the absence of misunderstanding, r x y is abbreviated as r ). Obviously . At that time , it is called complete linear correlation; at that time , it is called complete wireless correlation; when it is closer to 1 , the linear correlation is greater.

The following table gives the minimum value of the correlation coefficient (it is related to the number of observations n and the given reliability ), when it is greater than the corresponding value in the table, the matching straight line is meaningful.

 N — 2 = 5 % = 1 % n -2 = 5 % = 1 % n- 2 = 5 % = 1 % 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.997 0.950 0.878 0.811 0.754 0.707 0.666 0.632 0.602 0.576 0.553 0.532 0.514 0.497 0.482 1.000 0.990 0.959 0.917 0.874 0.834 0.798 0.765 0.735 0.708 0.684 0.661 0.641 0.623 0.606 16 17 18 19 20 twenty one twenty two twenty three twenty four 25 26 27 28 29 30 0.468 0.456 0.444 0.433 0.423 0.413 0.404 0.396 0.388 0.381 0.374 0.367 0.361 0.355 0.349 0.590 0.575 0.561 0.549 0.537 0.526 0.515 0.506 0.496 0.487 0.478 0.470 0.463 0.456 0.449 35 40 45 50 60 70 80 90 100 125 150 200 300 400 1000 0.325 0.304 0.288 0.273 0.250 0.232 0.217 0.205 0.195 0.174 0.159 0.138 0.113 0.098 0.062 0.418 0.393 0.372 0.354 0.325 0.302 0.283 0.267 0.254 0.228 0.208 0.181 0.148 0.128 0.081

Note that when the number of observations n is large , the correlation coefficient can be approximated by the following method: Plot the pair of observations ( x i , y i ) ( i =1,2,..., n ) on the coordinate paper, First, make a horizontal line to make the upper and lower points of the line equal, and then make a vertical line to make the left and right points equal. These two lines (try to make no points on the two lines) divide the plane into four pieces (Figure 16.5 ) and set it to the upper right square , upper left , lower left and lower right points are n 1 , n 2 , n 3 , n 4 respectively , let

n + =n 1 +n 3          =n 2 +n 4

Then the correlation coefficient is approximately

[ Remaining Standard Deviation ]

Called the residual standard deviation , it describes the precision of the regression line : for each x of the experimental range , 95.4% of the y values ​​fall on two parallel lines

between ( Fig. 16.6 ); 99.7% of the y values ​​fall between two parallel lines

between .

[ Calculation steps of unary regression ]   For the convenience of calculation , rewrite l xx , l yy , l xy as

and integerize the data . That is

After integerization , we have

,

So the list is calculated as follows :

 serial number 1   2     n mark = = - = - = - count Calculate Knot fruit Regression coefficients  Constant term    regression equation  Correlation coefficient  residual standard deviation

[ Analysis of variance for univariate linear regression ] The independent   variable x is regarded as a single factor , and the data y ij ( i = 1,2 , , n ; j = 1,2, , k ) , recorded as follows :

 y ij x 1                x 2 x n y 11   y 12 y 1 k _    y 21   y 22 y 2 k _            y n 1   y n 2 y nk _

Find the regression equation in pairs

The total sum of squares of y is

Referred to as

The S on the right side of the above is called the regression sum of squares , which is caused by the change of x , which also changes y ; , it is caused by other random factors or an inappropriate fit of the regression line .

Similar to the one-way ANOVA , the one-way linear regression ANOVA table is as follows :

 source of variance sum of square degrees of freedom mean square Statistics confidence limits statistical inference return     remaining     error S back     S Yu     S error k     n     n s back At that time , the impact was considered insignificant; At the time , the impact was considered significant total sum of squares S total nk

During the test , if the effect is not significant, it indicates that the residual sum of squares is basically caused by random factors such as experimental error; if the effect is significant, it indicates that there may be other factors that cannot be ignored, or x and y are not linearly related, or x and y It doesn't matter. At this time, the regression line obtained cannot describe the relationship between x and y , and it is necessary to further identify the cause and re-wiring.

During inspection , if the influence is significant, it indicates that there is a linear relationship between x and y ; if the influence is not significant, rewiring is required.

S total , S return , S remainder , and S error are calculated according to the following formulas (the data can be integerized first , :

S total =

S back =

S surplus =

S error = S total return surplus

in the formula

3. Parabolic regression

Given a set of observations ( x i , y i ) ( i = 1,2,..., n ) , if there is a parabolic relationship, a polynomial of degree m ( m 2 ) can be used

to fit. According to the principle of least squares, the

= minimum value

In particular, if p(x) is taken as a quadratic polynomial

Then the coefficients a , b , c satisfy the equations

in the formula

4. Curve regression that can be transformed into linear regression

If the number of observations forms a curve against the distribution on the graph paper, appropriate variable substitutions can be made to perform a linear regression on the two new variables. Then restore to the original variable.

Common curve types that can be straightened

 Curve type Linearized variable substitution 1 ° Assume but ( x , y ) is a straight line on double logarithmic paper Let X = x , Y = but ( x , y ) line up on logarithmic paper Assume but ( x , y ) line up on logarithmic paper Assume    but Assume    but The curve is the same as the type, but moved in the direction of the axis. First, take three points on the given curve: , , ), then After c is determined, set      but curve type Linearized variable substitution The curve is the same as the type, but moved in the direction of the axis, first take three points on the given curve:  but After confirming, set           but Assume    but Assume    but Take a point on the curve ( x 0 , y 0 ) Let X = x   but Using the regression line method, A and B can be determined from the given data Take a point on the curve ( x 0 , y 0 ) Let X = x   but Curve type Linearized variable substitution Let X = x   can be transformed into type 11 ° Let X = x   Y = y 2 can be transformed into type 11 ° Let X = x   can be transformed into type 11 ° Assume    can be transformed into type 11 ° Let X = x   but Converted to Type 11 ° If the given x values ​​form an arithmetic progression with h as tolerance, then let (value ) (value ) straight If the given x value constitutes an arithmetic series with h as the tolerance, let u 1 = x+h , u 2 = x+ 2 h , and the corresponding y values ​​are v 1 , v 2 set again and get Then use the regression line method to determine b and d , then set      then get

5. Binary line regression

[ Regression equation ] The values ​​corresponding to   the values ​​of the independent variables x 1 and x 2 are , so n points are obtained , and the regression equation is

where is the regression coefficient, which is determined by the following equation:

here

And where is the data transformation (without integerization) to simplify the calculation, that is

( 1 ) The constant term in the formula

[ Multiple correlation coefficient and partial term correlation coefficient ]

is called the complex correlation coefficient, where

Here it is shown in (2) . The complex correlation coefficient R is satisfied , and its meaning is similar to the correlation coefficient r in the single linear regression analysis, which is used to measure the closeness of the linear relationship between y and x 1 , x 2 .

If you only want to express the correlation between y and one of the variables ( x 1 or x 2 ) , then you must remove the influence of the other variable and then calculate their correlation coefficient, which is called the partial correlation coefficient. The correlation coefficient of x 1 , y after removing the influence of x 2 is called the partial correlation coefficient of x 1 , y to x 2 , denoted as , it can be expressed by ordinary correlation coefficient :

Similarly, the partial correlation coefficient table for the pair is

[ Remaining Standard Deviation ]

It is called the residual standard deviation, and its meaning is similar to the residual standard deviation s in the linear regression analysis .

[ Standard regression coefficient and partial regression sum of squares ]   When the relationship between the two factors x 1 and x 2 is not close, the following method can be used to determine which factor is the main one.

1 °

It is called the standard regression coefficient, where b 1 , b 2 are regression coefficients, l 11 , l 22 are shown in ( 2 ) , and l 00 is shown in ( 4 ) . If , it indicates that among the two factors affecting the variable, x 1 is the main factor and x 2 is the secondary factor.

2 °

It is called partial regression sum of squares, where b 1 , b 2 are regression coefficients, and l 11 , l 12 , and l 22 are shown in (2) . If p 1 >p 2 , it means that x 1 is the main factor, and x 2 is the secondary factor.

[ t -value ]

They are called the t values ​​of x 1 and x 2 , respectively, where s is the residual standard deviation, and p 1 and p 2 are the partial regression sums of squares. The larger the t value, the more important the factor is. According to experience, when t i > 1 , the factor x i has a certain influence on y ; when t i > 2 , the factor x i is regarded as an important factor; when t i < 1 , the factor x i is considered to be an important factor i has little effect on y and can be ignored and does not participate in the regression calculation.

[ Binary Linear Regression Calculation Table ]   x k i in the table is the simplified data.

 sequence No x 1 i y i x x 1   2 n x 11   x 12 x 1 n x 21   x 22 x 2 n y 1   y 2 y n Knot fruit

From , according to ( 2 ) and ( 3 ) are calculated respectively , get the regression equation

And continue to calculate the complex correlation coefficient R , the standard regression coefficients B 1 and B 2 , the partial regression square sum p 1 , , p 2 , and the t values ​​t 1 and t 2 , and perform binary regression analysis based on these data.

Regarding the binary nonlinear regression problem, appropriate variable substitution can be done to form a linear relationship between the new variables, and then the regression analysis can be performed.

6. Multiple linear regression

Consider the relationship between the independent variable and the dependent variable y , do n experiments, and the observed value is, let ; , let

set the matrix again

Its inverse matrix is

[ regression equation ]

where is the regression coefficient, which is represented by a vector as

b

in

Constant term

[ Multiple Correlation Coefficient ]

[ Remaining Standard Deviation ]

[ ANOVA table for multiple linear regression ]

 variance source sum of square degrees of freedom mean square Statistics _ confidence limits statistical inference back    return    leftover    Remain m           nm- 1 At that time , the regression was considered significant and the linear correlation was close; At that time , the regression was considered insignificant and the linear correlation was not close. total flat Fang He n- 1

[ Standard regression coefficients and partial regression sum of squares ]

Standard regression coefficients

Partial regression sum of squares

[ t -value ]

The multiple linear regression analysis is similar to the binary case, but the calculation amount is larger and can be done with the help of an electronic computer.