Chapter 8: Linear Regression

In the previous chapter we learned how to make a scatterplot of two quantitative variables, to check the conditions for correlation, and to compute the correlation. Now we go a few steps further, to find and interpret the equation of a linear model that will best fit the data, and examine the residuals. As before, we'll concentrate here on using technology to perform the computations.

Regression on the TI-84

Let's continue working with the size and 2007 assessed value data from the property tax data set (in the houses.txt file) that we considered in the Chapter 7 Resources. Follow the previous instructions to use LinReg(a+bx) to compute the correlation and the calculator will also give you the regression equation:

output of LinReg(z+bx) L1,L2 for size vs. asessed value

The calculator displays the equation we want as y = a + bx but it should be written as y = a + bx. The calculator can't display y so it uses y instead; you should always write y when working with pencil and paper, or type "y-hat" (or "assess-hat" or whatever your response variable happens to be called, followed by a "-hat") when typing online. In the case of our house data, our equation should be written as:

\hat{assess} \ = \ 249 \ + \ 0.038\times size

Also note that in the text the regression equation is not written as y = a + bx but rather as y = b0 + b1x. We should do this too, so just remember to mentally convert from TI-84 notation to Statistics notation when you are working with the calculator to solve a problem.

What does the regression equation for our house data tell us? The slope is 0.038 with units of $1000/ft2, or put more simply, $38 per square foot. We expect that for a house in this neighborhood, each additional square foot in size will be associated with an increase of about $38 in the assessed value, on average. If my neighbor's house is 100 square feet bigger than mine, I would predict that the assessed value of his home would be about $3800 more than the assessed value of my home. It's important to note, however, that this is just a prediction: his house will almost certainly not be valued exactly $3800 more than mine.

The intercept of 249 tells us that we would expect a house in this neighborhood of size 0 square feet to have an assessed value of about $249,000, but of course no house is 0 square feet! (You could argue that the $249,000 would represent the average value of a plot of land with no house, but there are no such properties in our data set, so this would involve extrapolation.)

Graphing the Regression Line on the TI-84

Once we have the regression equation it is useful to plot this equation along with the scatterplot of the data. Before we do this, it is advisable to clear out any other equations that might be stored on the calculator. So first press the Y= key and then use the CLEAR key to clear out anything you see in the Y= menu:

press Y= then use the CLEAR key to clear any entries

Now run LinReg(a+bx) again, but after typing LinReg(a+bx) L1,L2 (but BEFORE you press ENTER) type , (another comma):

type an addiitonal comma after LinReg L1,L2

Then press VARS and move the cursor right so that Y-VARS is highlighted:

press VARS and move the cursor to the right then press ENTER

Next press ENTER to get to the FUNCTION menu:

now press ENTER again

Now press ENTER again to so that LinReg(a+bx) L1,L2,Y1 is displayed on the screen:

LinReg(a+bx) L1,L2,Y1

Press ENTER one more time to run LinReg(a+bx). The output will be the same as before, but if you check the Y= menu you should now see the regression equation after Y1=:

press Y= to see the regression equation

To see the regression line graphed with the scatterplot, use ZoomStat:

regression line graphed with scatterplot of data

Plotting the Residuals with the TI-84

To graph the residuals, do the following: First press STAT, then ENTER to get to the list editor. Move the cursor up to the name of the first list, then move it to the right past L6, where there should be an empty space for a new list name:

in the list editor, move the cursor up and all the way to the right

Now press 2ND and STAT (in other words, LIST) and scroll down (if necessary) until you see RESID:

press 2ND and STAT

Then press ENTER twice. The residuals should appear in this new list:

press ENTER twice to see the RESID list

Now whenever you run LinReg(a+bx) the residuals for the new data set will automatically be stored in this list.

To create a scatterplot of the residuals, set up a plot in the STAT PLOT menu (you might want to turn off Plot1 and use Plot2 for the residual plot) with L1 (or whatever list contains your explanatory variable) for the Xlist and RESID for Ylist. (Follow the steps above to get to the name of the RESID list in the LIST menu.)

set up Plot2 as show, after turning off Plot1

Finally, use ZoomStat to see the residual plot:

plot of observed sizes vs. residuals

The plot looks reasonably boring, with no apparent pattern, which confirms our belief that the Straight Enough Condition is satisfied.

Regression with Data Desk

Regression equations and residuals with Data Desk are even easier. Once you have created a scatterplot of the data (as in the Chapter 7 Resources), click the hyperview menu of the scatterplot window and select Regression of assess vs size:

click the hyperview menu on the scatterplot window

You should then see the Data Desk regression output:

regression output in Data Desk

Most of this information will remain a mystery to us until later in the course, but notice the numbers in the lower-left corner:

closeup of lower-left corner of regression output

These are the intercept and slope of the regression line, giving us the same regression equation we got in the TI-84:

\hat{assess} \ = \ 249 \ + \ 0.038\times size

There are two other quantities that we will use at the present time. We see that R2 = 67.2%:

closeup of upper-left corner of regression output

This tells us that about 67% of the variability in assessed value of the house can be explained by the differences in the sizes of the houses. Notice that R2 = 67.2% = 0.672 ≈ (0.820)2 = r2, where r is the correlation that we computed previously; however (due to tradition) we'll use a lowercase r for the correlation and an uppercase R for R2, which is called the coefficient of determination in some texts (we'll usually just call it R2 or "R-squared").

We can also see from the Data Desk regression output that se = 9.481, or $9,481. This latter quantity (called just s on the calculator and computer) is the standard deviation of the residuals; the closer the residuals are to 0, the better the model will fit the data, and the smaller se will be. Since the residuals have the same units as the response (or y) variable (in this case the assessed value), we can compare this number se to sy, the standard deviation of the assessed values. Using the TI-84 (1-VarStats L2) or Data Desk you can compute sy ≈ 15.9, or $15,900. Since the standard deviation of the residuals is not that big when compared with the standard deviation of the assessed values, this indicates that there should be only a moderate amount of scatter of the data on either side of the regression line (of course we can also see this just by looking at the scatterplot!) and hence the regression equation should do a decent job of predicting the assessed value of these houses based up on their size.

Speaking of the residuals, they are just a click away in Data Desk. Click the hyperview menu of the regression window and select Scatterplot residual vs predicted:

hyperview menu of regression output window

This gives us a slightly different scatterplot than the TI-84 (where we graphed the residuals vs. the explanatory data values) but it reveals the same thing:

scatterplot of predicted values vs. residuals

As before, this is a reasonably boring plot, with no apparent pattern, which confirms our belief that the Straight Enough Condition is satisfied.

Homework

Work the following exercises in Chapter 8: 3, 7, 9, 11, 13, 15, 17, 19, 23, 31, 33, 35, 37 and 41.

Errata

This is more of an omission than an error, but the full data set for the Burger King menu items discussed on pp. 166–170 can be found in the data set folder on the CD and on the Intro Stats textbook Web site (look for the file called Ch08_BK_menu_items.txt).

In the sidebar at the very bottom of page 169, the last line should read r = 0.83 (not r = 0.86).

In exercise #27d, the standard deviation of the math scores should be 98.1 (not 96.1).

ActivStats

Work the activities on pages 8-1 through 8-3 in the ActivStats lesson book, as time permits

Quiz

The Chapter 8 Quiz offers additional practice with computing and interpreting regression coefficients and correlation.

Additional Resources

Describing Relationships
Episode 8 from Against All Odds features a discussion the linear regression model, while Episode 9 discusses the meaning of R2.
Carnegie Mellon: Introduction to Statistics
Carengie Mellon's open source statistics course includes a lesson called "Examining Relations" that includes a discussion of linear regression (see Case III of Unit 2, Module 2).
Sofia: Elementary Statistics
Lesson 12.3 of the Sofia Open Content Initiative's Elementary Statistics course includes a discussion of the regression equation and Lesson 12.5 discusses making predicitons. (Some of the terminology may be unfamiliar here since this course covers regression far later in the game than we do.)
TI-83 Resource: Linear Regression
Instructions for using the TI-83 for regression analysis.
Scatterplot, correlation and regression on the TI-83/84
Instructions for using LinReg on the TI-84.
LinReg Flash tutorial
A Flash tutorial on using the TI-83 for regression analysis, using data about the Seattle Mariners. (Ignore the discussion of "critical values in Table A-6.")
Least Squares
A Java applet that helps visualize the meaning of least squares regression.
Least Squares Down and Dirty
An exposition of the algebra behind the least squares regression formulas.