4. Regression Analysis
1. The principle of least squares
Let u be a function of variables x , y ,... , with m parameters a 1 , a 2
,..., a m , namely
u =f ( a 1 , a 2 ,..., a m ; x
, y ,...)
Now make n observations of u and x ,
y , . . . ( x i , y i , . . . ; u i )( i = 1,2, . . . , n ) . So the absolute error between the theoretical value of u and the observed value ui is
( i = 1,2,..., n )
The so-called least squares method is to require the above n errors in the sense of the smallest sum of squares, so that the function u =f ( a 1 , a 2
,..., a m ; x
, y ,... ) and the observed value u 1 , u 2 , ... , u n is the best fit. That is, the parameters a 1 , a 2 ,... , a m should be
minimum
According to the method of finding the extreme value of differential calculus, it can be known that a 1 , a 2
, ··· , a m should satisfy the following equations
( i = 1,2,..., n )
2. Univariate Linear Regression
[ Univariate regression equation ] The observed value corresponding to the independent variable x and the variable y is
|
|
|
|
|
|
|
|
|
|
If there is a linear relationship between the variables, a straight line can be used
to fit the relationship between them. By the least squares method, a , b should be
minimum
have to
in the formula
The equation is called the regression equation (or regression line), and b is called the regression coefficient.
[ Correlation coefficient and its test table ] The correlation coefficient r xy reflects the closeness of the linear relationship between the variables x and y , which is defined by the following formula
in
(In the absence of misunderstanding, r x y is abbreviated as r ). Obviously . At that time , it is called complete linear correlation; at that time , it is called complete wireless correlation; when it is closer to 1 , the linear correlation is greater.
The following table gives the minimum value of the correlation coefficient (it is related to the number of observations n and the given reliability ), when it is greater than the corresponding value in the table, the matching straight line is meaningful.
N — 2 |
= 5 % |
= 1 % |
n -2 |
= 5 % |
= 1 % |
n- 2 |
= 5 % |
= 1 % |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
0.997 0.950 0.878 0.811 0.754 0.707 0.666 0.632 0.602 0.576 0.553 0.532 0.514 0.497 0.482 |
1.000 0.990 0.959 0.917 0.874 0.834 0.798 0.765 0.735 0.708 0.684 0.661 0.641 0.623 0.606 |
16 17 18 19 20 twenty one twenty two twenty three twenty four 25 26 27 28 29 30 |
0.468 0.456 0.444 0.433 0.423 0.413 0.404 0.396 0.388 0.381 0.374 0.367 0.361 0.355 0.349 |
0.590 0.575 0.561 0.549 0.537 0.526 0.515 0.506 0.496 0.487 0.478 0.470 0.463 0.456 0.449 |
35 40 45 50 60 70 80 90 100 125 150 200 300 400 1000 |
0.325 0.304 0.288 0.273 0.250 0.232 0.217 0.205 0.195 0.174 0.159 0.138 0.113 0.098 0.062 |
0.418 0.393 0.372 0.354 0.325 0.302 0.283 0.267 0.254 0.228 0.208 0.181 0.148 0.128 0.081 |
Note that when the number of observations n is large , the correlation coefficient can be approximated by the following method: Plot the pair of observations ( x i , y i ) ( i =1,2,..., n ) on the coordinate paper, First, make a horizontal line to make the upper and lower points of the line equal, and then make a vertical line to make the left and right points equal. These two lines (try to make no points on the two lines) divide the plane into four pieces (Figure 16.5 ) and set it to the upper right square , upper left , lower left and lower right points are n 1 , n 2
, n 3 , n 4 respectively , let
n + =n 1 +n 3 =n 2 +n 4
Then the correlation coefficient is approximately
[ Remaining Standard Deviation ]
Called the residual standard deviation , it describes the precision of the regression line : for each x of the experimental range , 95.4% of the y values fall on two parallel lines
between ( Fig. 16.6 ); 99.7% of the y values fall between two parallel lines
between .
[ Calculation steps of unary regression ] For the convenience of calculation , rewrite l xx , l yy
, l xy as
and integerize the data . That is
After integerization , we have
,
So the list is calculated as follows :
serial number |
|
|
|
|
|
|||
1 2
n |
|
|
|
|
|
|||
|
|
|
|
|
|
|||
|
|
|
|
|
|
|||
|
|
|
|
|
|
|||
mark |
|
= |
= - |
= - |
= - |
|||
|
count Calculate Knot fruit |
Regression coefficients Constant term regression equation Correlation coefficient residual standard deviation |
|
|||||
[ Analysis of variance for univariate linear regression ] The independent variable x is regarded as a single factor , and the data y ij ( i = 1,2 , , n ; j = 1,2, , k ) , recorded as follows :
|
y ij |
|
x 1
x 2 x n |
y 11 y 12 y 1 k _ y 21 y 22 y 2 k _ y n 1 y n 2 y nk _ |
|
|
|
|
Find the regression equation in pairs
The total sum of squares of y is
Referred to as
The S on the right side of the above is called the regression sum of squares , which is caused by the change of x , which also changes y ; , it is caused by other random factors or an inappropriate fit of the regression line .
Similar to the one-way ANOVA , the one-way linear regression ANOVA table is as follows :
source of variance |
sum of square |
degrees of freedom |
mean square |
Statistics |
confidence limits |
statistical inference |
return remaining error |
S back S Yu S error |
k n n |
s back
|
|
|
At that time , the impact was considered insignificant; At the time , the impact was considered significant |
total sum of squares |
S total |
nk |
|
|
|
|
During the test , if the effect is not significant, it indicates that the residual sum of squares is basically caused by random factors such as experimental error; if the effect is significant, it indicates that there may be other factors that cannot be ignored, or x and y are not linearly related, or x and y It doesn't matter. At this time, the regression line obtained cannot describe the relationship between x and y , and it is necessary to further identify the cause and re-wiring.
During inspection , if the influence is significant, it indicates that there is a linear relationship between x and y ; if the influence is not significant, rewiring is required.
S total , S return , S remainder , and S error are calculated according to the following formulas (the data can be integerized first , :
S total =
S back =
S surplus =
S error = S total return surplus
in the formula
3. Parabolic regression
Given a set of observations ( x i , y i ) ( i = 1,2,..., n ) , if there is a parabolic relationship, a polynomial of degree m ( m 2 ) can be used
to fit. According to the principle of least squares, the
= minimum value
In particular, if p(x) is taken as a quadratic polynomial
Then the coefficients a , b , c satisfy the equations
in the formula
4. Curve regression that can be transformed into linear regression
If the number of observations forms a curve against the distribution on the graph paper, appropriate variable substitutions can be made to perform a linear regression on the two new variables. Then restore to the original variable.
Common curve types that can be straightened
Curve type |
Linearized variable substitution |
|
1 °
|
Assume but ( x
, y ) is a straight line on double logarithmic paper |
|
|
Let X = x , Y = but ( x
, y ) line up on logarithmic paper |
|
|
Assume but ( x
, y ) line up on logarithmic paper |
|
|
Assume but |
|
|
Assume but |
|
|
The curve is the same as the type, but moved in the direction of the axis. First, take three points on the given curve: , , ), then |
After c is determined, set but |
curve
|
type |
Linearized variable substitution |
|
The curve is the same as the type, but moved in the direction of the axis, first take three points on the given curve: but |
After confirming, set
but |
|
Assume but |
|
|
Assume but |
|
|
Take a point on the curve ( x 0 , y 0 ) Let X = x but Using the regression line method, A and B can be determined from the given data |
|
|
Take a point on the curve ( x 0 , y 0 ) Let X = x but |
|
Curve type |
Linearized variable substitution |
|
|
Let X = x
can be transformed into type 11 ° |
|
|
Let X = x Y = y 2 can be transformed into type 11 ° |
|
|
Let X = x can be transformed into type 11 ° |
|
|
Assume can be transformed into type 11 ° |
|
|
Let X = x but Converted to Type 11 ° |
|
|
If the given x values form an arithmetic progression with h as tolerance, then let (value ) (value ) straight |
|
|
If the given x value constitutes an arithmetic series with h as the tolerance, let u 1 = x+h , u 2 = x+ 2 h , and the corresponding y values are v 1 , v 2 set again and get Then use the regression line method to determine b and d , then set then get |
5. Binary line regression
[ Regression equation ] The values corresponding to the values of the independent variables x 1 and x 2 are , so n points are obtained , and the regression equation is
where is the regression coefficient, which is determined by the following equation:
here
And where is the data transformation (without integerization) to simplify the calculation, that is
( 1 ) The constant term in the formula
[ Multiple correlation coefficient and partial term correlation coefficient ]
is called the complex correlation coefficient, where
Here it is shown in (2) . The complex correlation coefficient R is satisfied , and its meaning is similar to the correlation coefficient r in the single linear regression analysis, which is used to measure the closeness of the linear relationship between y and x 1 , x 2 .
If you only want to express the correlation between y and one of the variables ( x 1 or x 2 ) , then you must remove the influence of the other variable and then calculate their correlation coefficient, which is called the partial correlation coefficient. The correlation coefficient of x 1 , y after removing the influence of x 2 is called the partial correlation coefficient of x 1 , y to x 2 , denoted as , it can be expressed by ordinary correlation coefficient :
Similarly, the partial correlation coefficient table for the pair is
[ Remaining Standard Deviation ]
It is called the residual standard deviation, and its meaning is similar to the residual standard deviation s in the linear regression analysis .
[ Standard regression coefficient and partial regression sum of squares ] When the relationship between the two factors x 1 and x 2 is not close, the following method can be used to determine which factor is the main one.
1 °
It is called the standard regression coefficient, where b 1 , b 2 are regression coefficients, l 11 , l 22 are shown in ( 2 ) , and l 00 is shown in ( 4 ) . If , it indicates that among the two factors affecting the variable, x 1 is the main factor and x 2 is the secondary factor.
2 °
It is called partial regression sum of squares, where b 1 , b 2 are regression coefficients, and l 11 , l 12 , and l 22 are shown in (2) . If p 1 >p 2 , it means that x 1 is the main factor, and x 2 is the secondary factor.
[ t -value ]
They are called the t values of x 1 and x 2 , respectively, where s is the residual standard deviation, and p 1 and p 2 are the partial regression sums of squares. The larger the t value, the more important the factor is. According to experience, when t i > 1 , the factor x i has a certain influence on y ; when t i > 2 , the factor x i is regarded as an important factor; when t i < 1 , the factor x i is considered to be an important factor i has little effect on y and can be ignored and does not participate in the regression calculation.
[ Binary Linear Regression Calculation Table ] x k i in the table is the simplified data.
sequence No |
x 1 i |
|
y i |
x |
x |
|
|
|
|
1 2
n |
x 11 x 12 x 1 n |
x 21 x 22 x 2 n |
y 1 y 2 y n |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Knot fruit |
|
|
|
|
|
|
From , according to ( 2 ) and ( 3 ) are calculated respectively , get the regression equation
And continue to calculate the complex correlation coefficient R , the standard regression coefficients B 1 and B 2 , the partial regression square sum p 1 , , p 2 , and the t values t 1 and t 2 , and perform binary regression analysis based on these data.
Regarding the binary nonlinear regression problem, appropriate variable substitution can be done to form a linear relationship between the new variables, and then the regression analysis can be performed.
6. Multiple linear regression
Consider the relationship between the independent variable and the dependent variable y , do n experiments, and the observed value is, let ; , let
set the matrix again
Its inverse matrix is
[ regression equation ]
where is the regression coefficient, which is represented by a vector as
b
in
Constant term
[ Multiple Correlation Coefficient ]
[ Remaining Standard Deviation ]
[ ANOVA table for multiple linear regression ]
variance source |
sum of square |
degrees of freedom |
mean square |
Statistics _ |
confidence limits |
statistical inference |
back return leftover Remain |
|
m nm- 1 |
|
|
|
At that time , the regression was considered significant and the linear correlation was close; At that time , the regression was considered insignificant and the linear correlation was not close. |
total flat Fang He |
|
n- 1 |
|
|
|
|
[ Standard regression coefficients and partial regression sum of squares ]
Standard regression coefficients
Partial regression sum of squares
[ t -value ]
The multiple linear regression analysis is similar to the binary case, but the calculation amount is larger and can be done with the help of an electronic computer.