Chapter 5: Describing Distributions Numerically

I'll concentrate below on instructions for using the TI-84 and Data Desk to compute mean, median, standard deviation and IQR of a data set, and to draw boxplots.

More Property Taxes

For our first example, let's work with the house data set previously encountered in the Chapter 2 and 4 Resources:

house size assess lot taxes stories
20911 1561 304.0 0.20 2604 1
20912 1038 297.6 0.20 280 1
20918 1224 289.5 0.17 2353 1
20921 1232 292.8 0.17 756 1
20924 1995 314.6 0.17 2620 2
20927 1714 322.7 0.18 2632 1
20930 1832 336.1 0.18 2779 2
21003 1095 279.0 0.18 2321 1
21006 2011 319.5 0.18 2663 2
21015 1366 289.3 0.18 2415 1
21018 1292 301.4 0.18 2477 1
21023 1458 314.3 0.18 1386 1
21028 2031 320.9 0.18 2676 2
21105 1366 304.0 0.18 2473 1

Let's compute the summary statistics for the 2007 assessed value variable .

Enter the assessed value data into a list, say L1, then press 2ND and QUIT (above the MODE key), as you did in Chapter 4 to create a histogram. Now press STAT, move the cursor to the right to highlight CALC and notice that 1-Var Stats is already highlighted:

with data in L1, press STAT, move cursor to CALC and press ENTER

Press ENTER and type L1 (2ND and then the 1 key). Your screen should look like this:

press L1 and then ENTER

Now press ENTER. The calculator will display many different values:

ouput of 1VarStats for assessed value data

These values are:

So far we can see that the mean of the 2007 assessed property values for homes in my neighborhood is $306,121, with a standard deviation of $15,898. You could compute these statistics "by hand" but it would take a ridiculously long time: ALWAYS use the calculator or computer to compute summary statistics, especially the standard deviation.

But there's more! Use the down cursor to scroll down the screen as far as you can. You should see:

5-number summary from 1VarStats output

We can now read off

We call these five quantities the 5-number summary for the data set. The median 2007 assessed value of a home in this neighborhood is $304,000. The IQR is given by IQR = Q3-Q1 = 319.5-292.8 = $26,700. Note that the TI-84 doesn't report the IQR directly, but it's a simple subtraction problem once we know Q1 and Q3.

Boxplots with the TI-84

To draw a boxplot of the assessed value data, follow the instructions in the Chapter 4 Resources for making a histogram, but choose the boxplot (or modified boxplot) icon instead of the histogram icon:

select the modified boxplot icon in the Stat Plots menu then use ZoomStat

Then use ZoomStat to get the boxplot:

boxplot of the assessed values

We can see a bit more clearly from the boxplot that the data is skewed positively (but notice that we can't tell if the data set is unimodal or bimodal from the boxplot, so we should look at both a histogram and a boxplot whenever possible). Note again that the axis isn't labeled and no scale is indicated, so this would not be a satisfactory graph on a HW solution, exam or project.

Summary statistics from frequency tables

Recall the example from the Chapter 5 Resources with data about the number of attempts students in my Fall 2006 online class made on a 5-point quiz. We displayed the number of attempts like this:

attempts count
0 3
1 8
2 8
3 4
4 2
5 2
6 1

As before, we can enter the number of attempts (the left column) into one list (L1) and the counts into the next list (L2). Now type 1-Var Stats L1 as above, but then type , (a comma, above the 7 key) and then L2:

type 1-VarStats L1,L2 then press ENTER

Now press ENTER. to get the summary statistics for the quiz attempts by the 28 students:

summary statistics for the quiz attempt data

Boxplots with Data Desk

To use a computer to make a boxplot, use Data Desk. Import the houses.txt data file (from the preceding link or from the Data Sets folder in the online classroom) into Data Desk, as we did in the Chapter 4 Resources. Click on the assess variable so that the variable's icon has a Y over it:

click the assess variable to designate as Y

then click on Plot and select BoxPlot Side by Side:

click Plot and then Boxplot Side by Side

You should see something like this

boxplot of assessed value data

You can adjust the plot options by clicking on the hyperview menu (the triangle in the upper-left corner of the boxplot window) and selecting BoxPlot Options:

click the hyperview menu and select BoxPlot Options

If you see some strange shading on your boxplot, I would recommend selecting Do not display 95% C.I.'s for comparing medians:

select options as shown and click OK

since you have no idea what this means yet; you can also select Set Defaults to make this the default display option.

As with the histogram in Chapter 4, you can make the boxplot window larger by clicking on the lower right corner of the window and dragging it across the screen. The variable name in our Data Desk boxplot is labeled and a scale is indicated on the axis, which is better than the TI-84, but the units are still missing. This would be better:

assessed value boxplot with improved labels

although I again had to hack this using Photoshop.

Summary Statistics with Data Desk

To compute summary statistics of the 2007 assessed value variable, select the assess variable as Y (as before) and click Calc, then Summaries and then Reports:

click Calc then Summaries then Reports

You should see output like this:

summary statistics for the 2007 assessed values

If you don't see all of the statistics that you want, click the hyperview menu and choose Select Summary Statistics.

in the hypeview menu, click Select Summary Statistics

Select or deselect the appropriate checkboxes and click OK:

select desired summary statistics and click OK

As we saw from the calculator, the mean assessed value is $306,121 with a standard deviation of $15,898.

Median and IQR vs. Mean and Standard Deviation

Keep in mind that you should never simply compute the summary statistics and report them: you should also draw a picture, such as a histogram, boxplot, or stem-and-leaf display. (This is fairly easy to do if you already have the data in the calculator or computer, and it's a good idea to draw the picture before you compute the summary statistics since a picture is often the easiest way to see that you have made a data entry error.)

If the data is roughly unimodal and symmetric, then the mean and standard deviation are usually the most appropriate measures of center and spread, respectively, for the data set; if, on the other hand, the data is strongly skewed or has one or more major outliers, you should report the median and IQR.

A boxplot of the 2006 property tax data for these homes reveals three outliers:

boxplot of the property tax variable

so we should report the median and IQR for the property tax variable, not the mean and standard deviation. If you do see a major outlier, you should investigate it: if it was the result of a data-entry error, you should correct it; if it was something that never should have been included in the data set in the first place (such as the age of the teacher in a data set consisting of the ages of students in a second-grade class), you can remove it; if it was reported in the wrong units (e.g. someone reporting their height in feet rather than inches) you can convert to the proper units. But you should never remove a data point just because it's an outlier.

You might, however, decide to report the summary statistics both with the outlier included and with it omitted. In the property tax data set, three of the homes are owned by senior citizens who participate in a program that freezes their property taxes (although they or their estate have to pay all of the deferred taxes when the home is sold). This explains the outliers, so we might choose to analyze the remaining 12 homes; if the remaining data is roughly unimodal and symmetric, then we could report the mean and standard deviation for the property taxes of a homeowner in this neighborhood not involved in the deferred-tax program.

Comparing groups

Use Data Desk to create a histogram of the size data from the houses.txt data set. You should get something like this:

histogram of the size variable

which appears bimodal. We certainly shouldn't report the mean and standard deviation for a variable like this. In fact, there may be two separate groups here.

With the histogram still open, double-click on the stories variable to open up the variable that lists the number of stories in each house.

stories variable open adjacent to size histogram

Now click on Modify and then Palettes to open up the Data Desk palettes (if some things disappear instead of appear, then click this again to make them reappear).

click Modify then Palettes

Click on the knife symbol to select it:

click on the knife symbol on the palette

Next hold down the SHIFT key and click on the rightmost bar of the histogram:

rightmost bar of histogram selected with knife tool

You should see that the all the houses in this upper group correspond to the 2-story houses on the data set. Perhaps it would be wise to investigate the 1-story and 2-story houses separately.

Click on the size variable to select it as Y, then hold down the SHIFT key and click on the stories variable to select it as X:

select size as Y and stories as X

Now click on Plot and Boxplot y by x:

click Plot then Boxplot y by x

You should see side-by-side boxplots, like this:

side-by-side boxplots of size variable for 1- and 2-story houses

Clearly the 2-story houses are bigger than the 1-story houses—which is not terribly surprising! You can make side-by-side boxplots on the TI-84 as well, but you'll need to manually enter the 1-story house sizes into one list and set up a boxplot of it (as described above) and then manually enter the 2-story house sizes into another list and set up another boxplot using Plot2 instead of Plot1; when you press GRAPH you should see both boxplots.

Homework

Work the following problems in Chapter 5: 11, 15, 21, 25, 27, 29, 33, 39 and 45. (As usual, you are encouraged to work additional problems.)

Errata

This is more of an omission than an error, but the full data set of the HALE values for the examples on pp. 72–75 can be found in the data set folder on the CD and on the Intro Stats texbook Web site (look for the file called Ch05_HALE.txt).

The data set for exercise #15 is also on the CD and Web site (Ch05_Population_growth.txt) even though this is not indicated by the usual T icon.

ActivStats

Work the activities on pages 5-1 through 5-4 in the ActivStats lesson book, as time permits

Additional Resources

Describing Distributions
Episode 3 from Against All Odds features a discussion of means, medians, quartiles and boxplots.
Decisions Through Data: Boxplots
Unit 5 of Decisions Through Data talks about boxplots and Unit 6 discusses standard deviation.
Carnegie Mellon: Introduction to Statistics
This open source course has a lesson called "One Quantitative Variable: Numerical Measures" that may be of interest (see Unit 2, Module 1).
Sofia: Elementary Statistics
Lessons 2.3 and 2.4 of the Sofia Open Content Initiative's Elementary Statistics course include a discussion of summary statistics.
Boxplot tool
A Java applet for creating boxplots.
TI-83 Resource: 1-VarStats
Instructions on creating a histogram with the TI-83; check out the link about entering data into lists if you having difficulty with that part of the process.
TI-83/84 Troubleshooting
Guide to some common errors encountered when using the TI-84.