Lecture 12

(Chapter 12 continued)

 

Regression and Correlation Techniques:

The population regression equation is written as:

 Yi = A + B.Xi + ei

where ei is the error term.

The sample regression equation is written as:

Yihat = a + b.Xi.

The coefficients estimators a and b are obtained by method of least squares:

b = { nSXiYi – (SXi)( (SYi)}/{nSXi2 – (SXi)2}

 

a = (SYi)/n – b. (SXi)/n

Given data on Y and X, we know how to compute a and b.

 

Characteristics of Least Squares Estimates:

 

Optimally, the analyst would like to know the population regression line. However, this regression line can not be calculated because the statistician has only the sample observations to work with. Therefore, the best the statistician can do is to calculate the sample regression line and use it as an estimate of the population regression line.

The statistics – a and b – are estimators of A and B, the constants in the population regression. These least squares estimators have the following desirable properties:

1)   Unbiasedness  - the sampling distributions of a and b have means A and B respectively.

2)   Efficiency – Least Squares (LS) estimators have the minimum standard deviation.

3)   Consistency – As the sample size becomes larger and larger, the values of a homes in on A and values of b homes in on B.

Because of these properties, LS estimator is widely used.

 

The Standard Error of Estimate 

Recall from Diagram 12.3 that the standard deviation of the conditional probability distribution of Yi is assumed to be the same independent of Xi . This standard deviation, denoted by se, is a measure of the amount of scatter about the regression line in the population. If se is large there is much scatter.

The sample statistic used to estimate  se is the standard error of estimate. It is given by

 

         se = { S(Yi – Yihat)2/n-2}1/2

 

The above formula can be simplified as:

 

se = { (SYi2 – aSYi - b SXiYi )/n-2}1/2

and easy to calculate.

 

Do exercises 12.8 (a-b), 12.9 (a-b), 12.10 (a), 12.11 (a), 12.12 (a), and 12.13 (a – e).

 

The coefficient of determination       

 

Once the regression line has been fitted, one would like to know how well the line fits the data.

Look at Figure 12.9.

One can logically look at the standard error of estimate se. But this measure should be looked at in relation to the total sum of squares in Yi which is defined as

                        S (Yi - Ybar)2

The total sum of squares can be decomposed into two parts:

 

(Yi - Y ba r) = (Yi – Yihat) + (Yihat - Y ba r)

 

Þ  S (Yi - Y ba r)2 = S (Yi – Yihat)2 + S (Yihat - Y ba r)2

Þ      Total variation in the dependent variable = variation in the dependent variable not explained by the regression + total variation in the dependent variable that is explained by the regression.

 

Coefficient of determination is defined as:

 

r2 = (variation explained by regression/total variation in Y)

 

   = (total variation in Y – variation not explained by regression)/ total variation in Y)

 

r2 = 1 – (variation not explained by regression/total variation in Y)

 

It is the proportion of the variation in Y that is explained by the regression.

 

r2 = [nSXiYi  -(SYi).(SXi)]2/[{nSXi2 – (SXi)2}{nSYi2 – (SYi)2}]

 

A convenient formula to compute r2 is

 

r2 = {aSYi - b SXiYi – (1/n) (SYi)2}/{SYi2 - (1/n) (SYi)2}