Chapter 9: Regression Wisdom

Chapter 9 is largely a review of the things that might go wrong when you compute correlations and perform regression analyses, with some important new warnings about what not to do in certain situations. As such, there is really nothing new with regard to the TI-84 or Data Desk, so let's work through a complete regression analysis problem using everything we've learned so far. I will work through the problem using Data Desk, but you can (and should) follow along making the same plots and computations using the TI-84.

Used Cars

The following data includes the year, make, model, mileage (in thousands of miles) and asking price (in US dollars) for each of 13 used Honda Odyssey minivans. The data was collected from the Web site of the Seattle P-I on April 25, 2005.

year make model mileage price
2004 Honda Odyssey EXL 20 26900
2004 Honda Odyssey EX 21 23000
2002 Honda Odyssey 33 17500
2002 Honda Odyssey 41 18999
2001 Honda Odyssey EX 43 17200
2001 Honda Odyssey EX 67 18995
2000 Honda Odyssey LX 46 13900
2000 Honda Odyssey EX 72.4 15250
2000 Honda Odyssey EX 82 13200
2000 Honda Odyssey 99 11000
1999 Honda Odyssey 71 13900
1998 Honda Odyssey 85 8350
1995 Honda Odyssey EX 100 5800

You can download this data set in a tab-delimited text format by right-clicking this link. Let's start by stating the W's:

Who: 13 used Honda Odyssey minivans.
Cases: Each minivan is a case.
What: Model year, make, model, mileage and asking price.
Variable: Model Year; Type: Quantitative; Units: Years
Variable: Make; Type: Categorical
Variable: Model; Type: Categorical
Variable: Mileage; Type: Quantitative; Units: Thousands of miles
Variable: Asking price; Type: Quantitative; Units: US dollars
When: April 25, 2005
Where: Not specified, but minivans were probably for sale in the greater Seattle area since they were being advertised on the Web site of a Seattle newspaper.
Why: Not specified.
How: The information was collected from the Web site of the Seattle P-I. It is not clear if these were all of the used Honda Odyssey minivans advertised for sale on that date, or if some unspecified method was used to select these 13 minivans.

Before computing the correlation or proceeding with a regression analysis we need to check the three correlation conditions:

Quantitative Variables Condition: There are three quantitative variables (year, mileage and price). Either the year or the mileage might explain the price (at least in part) so it would make sense to create a scatterplot of year vs. price or mileage vs. price. We choose to proceed with mileage vs. price, where mileage is the explanatory variable and price is the response variable.

Straight Enough Condition: We now construct a scatterplot:

scatterplot of price vs. mileage

The plot looks reasonably straight, so the condition appears to be satisfied.

Outlier Condition: There are no significant outliers, so the condition appears to be satisfied.

The scatterplot reveals a negative, linear association between mileage and price, with only a moderate amount of scatter. It is now safe to compute the correlation (r = -0.873):

correlation output from Data Desk

and proceed with regression analysis:

regression output from Data Desk

We determine that the regression equation for mileage on price is:

\hat{price} \ = \ 26495 \ - \ 180\times mileage

The intercept is 26495, which means we would predict that a used minivan with 0 miles would cost about $26,495; this doesn't really make sense (unless we interpret it as the asking price for a minivan that is sold right after it is driven off the new-car lot) but in any case interpreting the intercept as a meaningful predicted asking price would involve extrapolation, since all of the minivans in our sample have at least 20,000 miles.

The slope is 180, with units of $/(1000 miles); in other words, the slope is $0.18/mile. This tells us that for each additional mile we put on a Honda Odyssey minivan, we expect the asking price to decrease by about $0.18, on average. If your used Honda Odyssey minivan has 10,000 more miles on its odometer than you're neighbors used Honda Odyssey minivan, we would predict that you would ask about $1800 less if you decided to sell it than she would.

The regression output also tells us that R2 = 0.762, which means that 76.2% of the variation in price can be explained by the least squares regression on mileage. Since we now have the regression equation, we may wish to graph it along with a scatterplot of the data:

scatterplot of price vs. mileage, shown with the regression line

(You can do this in Data Desk by clicking the hyperview menu on the scatterplot window and selecting Add regression line.)

We don't notice any minivans with significant residuals, or any bending of the scatterplot that was not apparent previously. However, it doesn't hurt to check a scatterplot of the residuals:

scatterplot of residuals vs. predicted values

The residual scatterplot is mostly patternless, which helps confirm that we made the correct decision with regard to the Straight Enough Condition.

We do notice from the scatterplot that there is a slight gap in mileage between 46,000 miles and 67,000 miles. It is unclear from the data whether this gap is due to chance, or if it might reveal that there are two groups of minivans represented here. (Perhaps these groups might be newer minivans that were leased and reaching the end of their lease, and older minivans that were purchased outright and driven for a longer period of time; however, we have no way of knowing whether or not this is true without further examination of the minivans in question.)

We also see from the regression output that se = 2,914. Computing the summary statistics for the response variable (asking price):

summary statistics for the price variable

we see that sy = $5,723, which we can compare with se and note that the standard deviation of the residuals is smaller: there is less variation in the residuals than there is in the mileages alone.

Since we now know that the linear model is appropriate and reasonably strong, we may wish to use it to make predictions. For example, if we had a used Honda Odyssey minivan with 50,000 miles that we wanted to sell, we would predict that the asking price would be about 26495-180×50 ≈ 17495, so we might decide to ask $17,495 for our minivan when placing a classified ad in the Seattle P-I.

Homework

Work the following exercises in Chapter 9: 1, 3, 5, 7, 9, 11, 13, 15, 19, 23, 25 and 27.

Errata

The scatterplot of the data in exercise #4 shows the years listed as 75, 80, 85, etc., but in the data set on the CD (as well as the other information in this exercise and the related exercise #2) lists the years as 1975, 1980, 1985, etc.

ActivStats

Work the activities on pages 9-1 through 9-2 in the ActivStats lesson book, as time permits. We won't compute the statistic related to "leverage," but it is worth working through the activity associated with this term to understand the concept of leverage and which points are likely to be highly influential.

Additional Resources

Sofia: Elementary Statistics
Lesson 12.6 of the Sofia Open Content Initiative's Elementary Statistics course includes a discussion of outliers. (Some of the terminology may be unfamiliar here since this course covers regression far later in the game than we do.)
Anscombe's data sets
Four data sets to help convince you that you should always construct a scatterplot before computing the correlation coefficient or the equation of the regression line.
Regression applet
A Java applet to help visualize the influence of outliers on the regression line
GenderStats
Data from the World Bank on fertility rates and life expectancy for various countries, cited in exercise #24.