Lecture 12
(Chapter 12 continued)
Regression and Correlation
Techniques:
The
population regression equation is written as:
Yi = A + B.Xi + ei
where
ei is the error term.
The
sample regression equation is written as:
Yihat
= a + b.Xi.
The
coefficients estimators a and b are obtained by method of least
squares:
b = { nSXiYi – (SXi)( (SYi)}/{nSXi2 – (SXi)2}
a = (SYi)/n – b. (SXi)/n
Given data on Y and X, we know how to compute a and b.
Characteristics
of Least Squares Estimates:
Optimally, the analyst would like to know the population regression line. However, this regression line can not be calculated because the statistician has only the sample observations to work with. Therefore, the best the statistician can do is to calculate the sample regression line and use it as an estimate of the population regression line.
The statistics – a and b – are estimators of A and B, the constants in the population regression. These least squares estimators have the following desirable properties:
1) Unbiasedness - the sampling distributions of a and b have means A and B respectively.
2) Efficiency – Least Squares (LS) estimators have the minimum standard deviation.
3) Consistency – As the sample size becomes larger and larger, the values of a homes in on A and values of b homes in on B.
Because of these properties, LS estimator is widely used.
Recall from Diagram 12.3 that the standard deviation of the conditional probability distribution of Yi is assumed to be the same independent of Xi . This standard deviation, denoted by se, is a measure of the amount of scatter about the regression line in the population. If se is large there is much scatter.
The sample statistic used to estimate se is the standard error of estimate. It is given by
se = { S(Yi – Yihat)2/n-2}1/2
The above formula can be simplified as:
se = { (SYi2
– aSYi
- b SXiYi
)/n-2}1/2
and easy to calculate.
Do exercises 12.8 (a-b), 12.9 (a-b), 12.10 (a), 12.11 (a), 12.12 (a), and 12.13 (a – e).
The coefficient
of determination
Once the regression line has been fitted, one would like to know how well the line fits the data.
Look at Figure 12.9.
One can logically look at the standard error of estimate se. But this measure should be looked at in relation to the total sum of squares in Yi which is defined as
S
(Yi - Ybar)2
The total sum of squares can be decomposed into two parts:
(Yi - Y ba r) = (Yi – Yihat) + (Yihat - Y ba r)
Þ S (Yi
- Y ba r)2 = S (Yi – Yihat)2
+ S
(Yihat - Y ba r)2
Þ Total variation in the dependent variable = variation in the dependent variable not explained by the regression + total variation in the dependent variable that is explained by the regression.
Coefficient of determination is defined as:
r2
= (variation explained by regression/total variation in Y)
= (total variation in Y – variation not
explained by regression)/ total variation in Y)
r2
= 1 – (variation not explained by regression/total variation in Y)
It is the proportion of the variation in Y that is explained by the regression.
r2
= [nSXiYi -(SYi).(SXi)]2/[{nSXi2 – (SXi)2}{nSYi2 – (SYi)2}]
A convenient formula to compute r2 is
r2 = {aSYi - b SXiYi – (1/n) (SYi)2}/{SYi2 - (1/n) (SYi)2}