Chapter 9: Regression Wisdom
Chapter 9 is largely a review of the things that might go wrong when you compute correlations and perform regression analyses, with some important new warnings about what not to do in certain situations. As such, there is really nothing new with regard to the TI-84 or Data Desk, so let's work through a complete regression analysis problem using everything we've learned so far. I will work through the problem using Data Desk, but you can (and should) follow along making the same plots and computations using the TI-84.
Used Cars
The following data includes the year, make, model, mileage (in thousands of miles) and asking price (in US dollars) for each of 13 used Honda Odyssey minivans. The data was collected from the Web site of the Seattle P-I on April 25, 2005.
| year | make | model | mileage | price |
| 2004 | Honda | Odyssey EXL | 20 | 26900 |
| 2004 | Honda | Odyssey EX | 21 | 23000 |
| 2002 | Honda | Odyssey | 33 | 17500 |
| 2002 | Honda | Odyssey | 41 | 18999 |
| 2001 | Honda | Odyssey EX | 43 | 17200 |
| 2001 | Honda | Odyssey EX | 67 | 18995 |
| 2000 | Honda | Odyssey LX | 46 | 13900 |
| 2000 | Honda | Odyssey EX | 72.4 | 15250 |
| 2000 | Honda | Odyssey EX | 82 | 13200 |
| 2000 | Honda | Odyssey | 99 | 11000 |
| 1999 | Honda | Odyssey | 71 | 13900 |
| 1998 | Honda | Odyssey | 85 | 8350 |
| 1995 | Honda | Odyssey EX | 100 | 5800 |
You can download this data set in a tab-delimited text format by right-clicking this link. Let's start by stating the W's:
Who: 13 used Honda Odyssey minivans.
Cases: Each minivan is a case.
What: Model year, make, model, mileage and asking price.
Variable: Model Year; Type: Quantitative; Units: Years
Variable: Make; Type: Categorical
Variable: Model; Type: Categorical
Variable: Mileage; Type: Quantitative; Units: Thousands of miles
Variable: Asking price; Type: Quantitative; Units: US dollars
When: April 25, 2005
Where: Not specified, but minivans were probably for sale in the greater Seattle area since they were being advertised on the Web site of a Seattle newspaper.
Why: Not specified.
How: The information was collected from the Web site of the Seattle P-I. It is not clear if these were all of the used Honda Odyssey minivans advertised for sale on that date, or if some unspecified method was used to select these 13 minivans.
Before computing the correlation or proceeding with a regression analysis we need to check the three correlation conditions:
Quantitative Variables Condition: There are three quantitative variables (year, mileage and price). Either the year or the mileage might explain the price (at least in part) so it would make sense to create a scatterplot of year vs. price or mileage vs. price. We choose to proceed with mileage vs. price, where mileage is the explanatory variable and price is the response variable.
Straight Enough Condition: We now construct a scatterplot:

The plot looks reasonably straight, so the condition appears to be satisfied.
Outlier Condition: There are no significant outliers, so the condition appears to be satisfied.
The scatterplot reveals a negative, linear association between mileage and price, with only a moderate amount of scatter. It is now safe to compute the correlation (r = -0.873):

and proceed with regression analysis:

We determine that the regression equation for mileage on price is:
The intercept is 26495, which means we would predict that a used minivan with 0 miles would cost about $26,495; this doesn't really make sense (unless we interpret it as the asking price for a minivan that is sold right after it is driven off the new-car lot) but in any case interpreting the intercept as a meaningful predicted asking price would involve extrapolation, since all of the minivans in our sample have at least 20,000 miles.
The slope is 180, with units of $/(1000 miles); in other words, the slope is $0.18/mile. This tells us that for each additional mile we put on a Honda Odyssey minivan, we expect the asking price to decrease by about $0.18, on average. If your used Honda Odyssey minivan has 10,000 more miles on its odometer than you're neighbors used Honda Odyssey minivan, we would predict that you would ask about $1800 less if you decided to sell it than she would.
The regression output also tells us that R2 = 0.762, which means that 76.2% of the variation in price can be explained by the least squares regression on mileage. Since we now have the regression equation, we may wish to graph it along with a scatterplot of the data:

(You can do this in Data Desk by clicking the hyperview menu on the scatterplot window and selecting Add regression line.)
We don't notice any minivans with significant residuals, or any bending of the scatterplot that was not apparent previously. However, it doesn't hurt to check a scatterplot of the residuals:

The residual scatterplot is mostly patternless, which helps confirm that we made the correct decision with regard to the Straight Enough Condition.
We do notice from the scatterplot that there is a slight gap in mileage between 46,000 miles and 67,000 miles. It is unclear from the data whether this gap is due to chance, or if it might reveal that there are two groups of minivans represented here. (Perhaps these groups might be newer minivans that were leased and reaching the end of their lease, and older minivans that were purchased outright and driven for a longer period of time; however, we have no way of knowing whether or not this is true without further examination of the minivans in question.)
We also see from the regression output that se = 2,914. Computing the summary statistics for the response variable (asking price):

we see that sy = $5,723, which we can compare with se and note that the standard deviation of the residuals is smaller: there is less variation in the residuals than there is in the mileages alone.
Since we now know that the linear model is appropriate and reasonably strong, we may wish to use it to make predictions. For example, if we had a used Honda Odyssey minivan with 50,000 miles that we wanted to sell, we would predict that the asking price would be about 26495-180×50 ≈ 17495, so we might decide to ask $17,495 for our minivan when placing a classified ad in the Seattle P-I.
Homework
Work the following exercises in Chapter 9: 1, 3, 5, 7, 9, 11, 13, 15, 19, 23, 25 and 27.
Errata
The scatterplot of the data in exercise #4 shows the years listed as 75, 80, 85, etc., but in the data set on the CD (as well as the other information in this exercise and the related exercise #2) lists the years as 1975, 1980, 1985, etc.
ActivStats
Work the activities on pages 9-1 through 9-2 in the ActivStats lesson book, as time permits. We won't compute the statistic related to "leverage," but it is worth working through the activity associated with this term to understand the concept of leverage and which points are likely to be highly influential.
