4. Regression Analysis
1. The principle of least squares
Let u be a function of variables x , y ,... , with m parameters a _{1} , a _{2
} ,..., a _{m} , namely
u =f ( a _{1} , a _{2} ,..., a _{m} ; x
, y ,...)
Now make n observations of u and x ,
y , . . . ( x _{i} , y _{i} , . . . ; u _{i} )( i = 1,2, . . . , n ) . So the absolute error between _{the} theoretical value of u and the observed value ui is_{}_{}_{
} _{}_{
}_{}_{}
_{} ( i = 1,2,..., n )
The socalled least squares method is to require the above n errors in the sense of the smallest sum of squares, so that the function u =f ( a _{1} , a _{2}
,..., a _{m} ; x
, y ,... ) and the observed value u _{1} , u _{2} , ... , u _{n is} the best fit. That is, the parameters a _{1} , a _{2} ,... , a _{m} should be
_{}minimum
According to the method of finding the extreme value of differential calculus, it can be known that a _{1} , a _{2
} , ··· , a _{m} should satisfy the following equations
_{} ( i = 1,2,..., n )
2. Univariate Linear Regression
[ Univariate regression equation ] The observed value corresponding to the independent variable x and the variable y is
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
If there is a linear relationship between the variables, a straight line can be used
_{}
to fit the relationship between them. By the least squares method, a , b should be
_{}minimum
have to
_{}
in the formula
_{} _{}
_{} _{}
The equation is called the regression equation (or regression line), and b is called the regression coefficient._{}
[ Correlation coefficient and its test table ] The correlation coefficient r _{xy} reflects the closeness of the linear relationship between the variables x and y , which is defined by the following formula
_{}in_{}
(In the absence of misunderstanding, r _{x y} is abbreviated as r ). Obviously . At that time , it is called complete linear correlation; at that time , it is called complete wireless correlation; when it is closer to 1 , the linear correlation is greater._{}_{}_{}_{}
The following table gives the minimum value of the correlation coefficient (it is related to the number of observations n and the given reliability ), when it is greater than the corresponding value in the table, the matching straight line is meaningful._{}_{}
N — 2 
_{}= 5 % 
_{}= 1 % 
n 2 
_{}= 5 % 
_{}= 1 % 
n 2 
_{}= 5 % 
_{}= 1 % 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
0.997 0.950 0.878 0.811 0.754 0.707 0.666 0.632 0.602 0.576 0.553 0.532 0.514 0.497 0.482 
1.000 0.990 0.959 0.917 0.874 0.834 0.798 0.765 0.735 0.708 0.684 0.661 0.641 0.623 0.606 
16 17 18 19 20 twenty one twenty two twenty three twenty four 25 26 27 28 29 30 
0.468 0.456 0.444 0.433 0.423 0.413 0.404 0.396 0.388 0.381 0.374 0.367 0.361 0.355 0.349 
0.590 0.575 0.561 0.549 0.537 0.526 0.515 0.506 0.496 0.487 0.478 0.470 0.463 0.456 0.449 
35 40 45 50 60 70 80 90 100 125 150 200 300 400 1000 
0.325 0.304 0.288 0.273 0.250 0.232 0.217 0.205 0.195 0.174 0.159 0.138 0.113 0.098 0.062 
0.418 0.393 0.372 0.354 0.325 0.302 0.283 0.267 0.254 0.228 0.208 0.181 0.148 0.128 0.081 
Note that when the number of observations n is large , the correlation coefficient can be approximated by the following method: Plot the pair of observations ( x _{i} , y _{i} ) ( i =1,2,..., n ) on the coordinate paper, First, make a horizontal line to make the upper and lower points of the line equal, and then make a vertical line to make the left and right points equal. These two lines (try to make no points on the two lines) divide the plane into four pieces (Figure 16.5 ) and set it to the upper right square , upper left , lower left and lower right points are n _{1} , n _{2}
, n _{3} , n _{4} respectively , let _{}_{}_{}_{}_{}_{}
n _{+} =n _{1 }+n _{3 }=n _{2 }+n _{4}
Then the correlation coefficient is approximately
_{}
[ Remaining Standard Deviation ]
_{}
Called the residual standard deviation , it describes the precision of the regression line : for each x of the experimental range , 95.4% of the y values fall on two parallel lines
_{}
between ( Fig. 16.6 ); 99.7% of the y values fall between two parallel lines
_{}
between .
[ Calculation steps of unary regression ] For the convenience of calculation , rewrite l _{xx} , l _{yy}
, l _{xy} as
_{}
and integerize the data . That is
_{} _{}
After integerization , we have
_{} _{}, _{}
_{} _{} _{}
So the list is calculated as follows :
serial number 
_{} 
_{} 
_{} 
_{} 
_{} 

1 2
_{} n 
_{} _{} _{} _{} 
_{} _{} _{} _{} 
_{} _{} _{} _{} 
_{} _{} _{} _{} 
_{} _{}
_{} _{} 

_{} 
_{} 
_{} 
_{} 
_{} 
_{} 


_{} 
_{} 
_{} 
_{} 
_{} 

_{} 
_{}_{} 
_{}_{} 
_{} 
_{} 
_{} 

mark 
_{}_{}_{} 
_{}=_{} 
_{}=_{} _{}_{} 
_{}=_{} _{} 
_{}=_{} _{} 


count Calculate Knot fruit 
Regression coefficients _{} Constant term _{} regression equation _{} Correlation coefficient _{} residual standard deviation_{} 


[ _{Analysis} of variance for univariate linear regression ] The independent variable x is regarded as a single factor , and the data y _{ij} ( i = 1,2 , , n ; j = 1,2, , k ) , recorded as follows :_{}
_{}_{} 
y _{ij} 
_{} 
x _{1 } x _{2 }x _{n} _{} _{}_{} 
y _{11 } y _{12} y _{1 }_{k} _ _{} y _{21 } y _{22} y _{2 }_{k} _ _{} _{} _{} _{} y _{n}_{ 1 } y _{n}_{ 2 }y nk _{_} _{} 
_{} _{} _{} _{}_{} 


_{} 
Find the regression equation in pairs_{}
_{}
_{}
The total sum of squares of y is
_{}
Referred to as
_{}
The S on the right side of the above is _{called }_{the} regression _{sum }of squares , which is caused by the change of x , which also changes y ; , it is caused by other random factors or an inappropriate fit of the regression line ._{}_{}
Similar to the oneway ANOVA , the oneway linear regression ANOVA table is as follows :
source of variance 
sum of square 
degrees of freedom 
mean square 
Statistics 
confidence limits 
statistical inference 
return remaining error 
S _{back} S _{Yu} S _{error} 
k_{} n_{} n 
s _{back} _{} _{} 
_{} _{} 
_{} _{} 
At that time , the impact was considered insignificant;_{} At the time , the impact was considered significant_{} 
total sum of squares 
S _{total} 
nk_{} 




During the test , if the effect is not significant, it indicates that the residual sum of squares is basically caused by random factors such as experimental error; if the effect is significant, it indicates that there may be other factors that cannot be ignored, or x and y are not linearly related, or x and y It doesn't matter. At this time, the regression line obtained cannot describe the relationship between x and y , and it is necessary to further identify the cause and rewiring._{}
During inspection , if the influence is significant, it indicates that there is a linear relationship between x and y ; if the influence is not significant, rewiring is required._{}
S _{total} , S _{return} , S _{remainder} , and S _{error} are calculated according to the following formulas (the data can be integerized first , :_{}_{}
S _{total} =_{}
S _{back} =_{}
S _{surplus} =_{}
S _{error} = S _{total }_{return }_{surplus}_{}_{}_{}_{}
in the formula
_{}
3. Parabolic regression
Given a set of observations ( x _{i} , y _{i} ) ( i = 1,2,..., n ) , if there is a parabolic relationship, a polynomial of degree m ( m_{} 2 ) can be used
_{}
to fit. According to the principle of least squares, the_{}
_{}= minimum value
In particular, if p(x) is taken as a quadratic polynomial
_{}
Then the coefficients a , b , c satisfy the equations
_{}
in the formula
_{}
4. Curve regression that can be transformed into linear regression
If the number of observations forms a curve against the distribution on the graph paper, appropriate variable substitutions can be made to perform a linear regression on the two new variables. Then restore to the original variable.
Common curve types that can be straightened
Curve type 
Linearized variable substitution 

1 ^{°} _{}

Assume_{} but_{} ( x
, y ) is a straight line on double logarithmic paper 

_{} 
Let X = x , Y =_{} but_{} ( x
, y ) line up on logarithmic paper 

_{} 
Assume_{} but_{} ( x
, y ) line up on logarithmic paper 

_{} 
Assume_{} _{} but_{} 

_{} 
Assume_{} _{} but_{} 

_{} 
The curve is the same as the type, but moved in the direction of the axis. First, take three points on the given curve: , ,_{}_{}_{}_{} _{}), then_{} 
After c is determined, set _{} but_{} 
curve

type 
Linearized variable substitution 
_{} 
The curve is the same as the type, but moved in the direction of the axis, first take three points on the given curve:_{}_{}_{} _{}_{} but_{} 
After confirming, set_{}
_{} but_{} 
_{} 
Assume_{} _{} but_{} 

_{} 
Assume_{} _{} but_{} 

_{} 
Take a point on the curve ( x _{0} , y _{0} ) Let X = x _{} but_{} Using the regression line method, A and B can be determined from the given data 

_{} 
Take a point on the curve ( x _{0} , y _{0} ) Let X = x _{} but_{} 

Curve type 
Linearized variable substitution 

_{} 
Let X = x _{} can be transformed into type 11 ^{°}^{} 

_{} 
Let X = x Y = y ^{2}^{} can be transformed into type 11 ^{°} 

_{} 
Let X = x _{} can be transformed into type 11 ^{°} 

_{} 
Assume_{} _{} can be transformed into type 11 ^{°} 

_{} 
Let X = x _{} but _{} Converted to Type 11 ^{°} 

_{} 
If the given x values form an arithmetic progression with h as tolerance, then let _{} (value )_{} _{} (value )_{} straight _{} 

_{} 
If the given x value constitutes an arithmetic series with h as the tolerance, let u _{1 }= x+h , u _{2 }= x+ 2 h , and the corresponding y values are v _{1} , v _{2}_{}_{ }_{}_{} set again_{} and get _{} Then use the regression line method to determine b and d , then set _{} then get _{} 
5. Binary line regression
[ Regression equation ] The values corresponding to the values of the independent variables x _{1} and x _{2} are , so n points are obtained , and the regression equation is_{}_{}_{}
_{}
where is the regression coefficient, which is determined by the following equation:_{}
_{}
here
_{}
_{}
_{}
_{}
_{}
_{}
_{}
_{}
And where is the data transformation (without integerization) to simplify the calculation, that is_{}
_{}
( 1 ) The constant term in the formula
_{}
[ Multiple correlation coefficient and partial term correlation coefficient ]
_{}
is called the complex correlation coefficient, where
_{}
Here it is shown in (2) . The complex correlation coefficient R is satisfied , and its meaning is similar to the correlation coefficient r in the single linear regression analysis, which is used to measure the closeness of the linear relationship between y and x _{1} , x _{2 .}_{}_{}_{}_{}
If you only want to express the correlation between y and one of the variables ( x _{1} or x _{2} ) , then you must remove the influence of the other variable and then calculate their correlation coefficient, which is called the partial correlation coefficient. The correlation coefficient of x _{1} , y after removing the influence of x _{2} is called the partial correlation coefficient of x _{1} , y to x _{2} , denoted as , it can be expressed by ordinary correlation coefficient :_{}_{}_{}_{}
_{}
Similarly, the partial correlation coefficient table for the pair is_{}_{}
_{}
[ Remaining Standard Deviation ]
_{}
It is called the residual standard deviation, and its meaning is similar to the residual standard deviation s in the linear regression analysis .
[ Standard regression coefficient and partial regression sum of squares ] When the relationship between the two factors x _{1} and x _{2} is not close, the following method can be used to determine which factor is the main one.
1 ^{°} _{}
_{}
It is called the standard regression coefficient, where b _{1} , b _{2} are regression coefficients, l _{11} , l _{22} are shown in ( 2 ) , and l _{00} is shown in ( 4 ) . If , it indicates that among the two factors affecting the variable, x _{1} is the main factor and x _{2} is the secondary factor._{}_{}_{}
2 ^{°} _{}
_{}
It is called partial regression sum of squares, where b _{1} , b _{2} are regression coefficients, and l _{11} , l _{12} , and l _{22} are shown in (2) . If p _{1 }>p _{2} , it means that x _{1} is the main factor, and x _{2} is the secondary factor.
[ t value ]
_{} _{}
They are called the t values of x _{1} and x _{2} , respectively, where s is the residual standard deviation, and p _{1} and p _{2} are the partial regression sums of squares. The larger the t value, the more important the factor is. According to experience, when t _{i} > 1 , the factor x _{i} has a certain influence on y ; when t _{i} > 2 , the factor x _{i} is regarded as an important factor; when t _{i} < 1 , the factor x i is considered to be an important factor _{i} has little effect on y and can be ignored and does not participate in the regression calculation._{}_{} _{}_{}_{}_{}_{}_{}
[ Binary Linear Regression Calculation Table ] x _{k }_{i} in the table is the simplified data._{}
sequence No 
x _{1 i} 
_{} 
y _{i} 
x_{} 
x_{} 
_{} 
_{} 
_{} 
_{} 
1 2 _{} n 
x _{11} x _{12} _{} x _{1 n} 
x _{21} x _{22} _{} x _{2 n} 
y _{1} y _{2} _{} y _{n}_{} 
_{} _{} _{} 
_{} _{} _{} 
_{} _{} _{} 
_{} _{} _{} 
_{} _{} _{} 
_{} _{} _{} _{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{ } 
_{Knot}_{} _{fruit}_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
_{} 
From , according to ( 2 ) and ( 3 ) are calculated respectively , get the regression equation_{}_{}
_{}
And continue to calculate the complex correlation coefficient R , the standard regression coefficients B _{1} and B _{2} , the partial regression square sum p _{1 ,} , p _{2} , and the t values t _{1} and t _{2} , and perform binary regression analysis based on these data.
Regarding the binary nonlinear regression problem, appropriate variable substitution can be done to form a linear relationship between the new variables, and then the regression analysis can be performed.
6. Multiple linear regression
Consider the relationship between the independent variable and the dependent variable y , do n experiments, and the observed value is, let ; , let_{}_{}_{}
_{} _{}
_{} _{}
_{}
_{}
_{} _{}
set the matrix again
_{}
Its inverse matrix is
_{}
[ regression equation ]
_{}
where is the regression coefficient, which is represented by a vector as_{}
b_{}_{}
in
_{}
Constant term
_{}
[ Multiple Correlation Coefficient ]
_{}
[ Remaining Standard Deviation ]
_{}
[ ANOVA table for multiple linear regression ]
variance source 
sum of square 
degrees of freedom 
mean square 
Statistics _ 
confidence limits 
statistical inference 
back return leftover Remain 
_{} _{} 
m nm 1 
_{} _{} 
_{} 
_{} 
At that time , the regression was considered significant and the linear correlation was close;_{} At that time , the regression was considered insignificant and the linear correlation was not close._{} 
total flat Fang He 
_{} 
n 1 




[ Standard regression coefficients and partial regression sum of squares ]
Standard regression coefficients _{} _{}
Partial regression sum of squares _{} _{}
[ t value ]
_{} _{}
The multiple linear regression analysis is similar to the binary case, but the calculation amount is larger and can be done with the help of an electronic computer.