Chapter 3: Displaying and Describing Categorical Data

Categorical vs. Quantitative

This chapter deals exclusively with graphing categorical data, while the next chapter will deal with graphing quantitative data. Don't assume, however, that all of the data in the exercises for this chapter is of the appropriate type. You should always check that data is categorical before constructing a bar chart or pie chart.

Pepsi vs. Coke (revisited)

The Chapter 2 Resources contain an example about a survey administered to 41 Statistics students. For each student, his or her gender was recorded, along with whether the student preferred Coke-brand beverages, Pepsi-brand beverages, or neither. For reference, here is the data set one more time:

gender beverage
female coke
female coke
female coke
female coke
female coke
female coke
female coke
female pepsi
female pepsi
female pepsi
female pepsi
female pepsi
female pepsi
female pepsi
female pepsi
female pepsi
female pepsi
female neither
female neither
female neither
female neither
female neither
female neither
female neither
female neither
male coke
male coke
male coke
male coke
male coke
male coke
male coke
male coke
male coke
male pepsi
male pepsi
male pepsi
male pepsi
male neither
male neither
male neither

Let's investigate the beverage preference of these 41 students. We can count that 16 prefer Coke and 14 prefer Pepsi, while 11 prefer "neither." We can summarize these counts using a frequency table:

beverage count
Coke 16
Pepsi 14
neither 11

We are dealing with one categorical variable (beverage) here, so the Categorical Data Condition is satisfied and we can display this data using a bar chart or a pie chart. It's not difficult to work make a bar chart by hand, especially with the aid of some graph paper:

bar chart of beverage data

Notice that we label each bar to indicate the category it represents.

Bar charts and pie charts with Data Desk

Now let's use technology to make a similar bar chart. Throughout most of the text we'll use the TI-83 or TI-84 calculator as well, but these graphing calculators can't handle categorical data, so for now we'll use computer software exclusively. In the past you may have used a spreadsheet program, such as Microsoft Excel, to make graphs such as this. Excel actually does a decent job of sorting and displaying categorical data, so we could use it in this chapter. We will see in future chapters, however, that Excel does not make appropriate displays of quantitative data (and it has many other problems as well). Thus, even though it is handy for entering and sorting data, we should use other tools for most statistical calculations. In place of Excel, we'll use Data Desk, a program that is included on the CD that came with your textbook.

Note: Some students find Data Desk a bit daunting at first. In part, that's because working with categorical variables in Data Desk is more challenging that working with quantitative variables. I encourage you to follow along with the instructions below, but if you're short on time you can do most of the exercises in Chapter 3 without Data Desk, so just skim through what follows and come back to it when you have time.

If you have worked through the first lessons in ActivStats you should know how to start Data Desk from within the ActivStats program by clicking the magnifying glass icon:

accessing Data Desk from within ActivStats

To access the program directly, make sure the Intro Stats CD is in your computer's CD drive and right-click on My Computer on your desktop or in the start menu and choose Open. Right-click on the drive containing the ActivStats CD and select Open.

in My computer, right-click on the ActivStats CD drive and select Open

Double-click on the Course folder:

double-click on the Course folder

and then double-click on the Data Desk AS.exe program to run Data Desk from the CD:

double-click on Data Desk AS.exe to run Data Desk

If you want, you can click once on this file, and hold down the mouse button while dragging this program to your desktop or a folder (or even a USB flash drive); then you can run it directly from your computer or USB drive so that you don't always need the CD to use Data Desk. I recommend this option, as it's much more convenient than running the program from the CD.

Now we need to get the data into Data Desk. We'll learn how to create data files from scratch later, but for now we'll use a ready-made data file, called beverage.txt, which you can access directly by right-clicking on the following link:

beverage.txt

and choosing "Save Link As..." or "Save Target As..." from the drop-down menu. Save the file to your computer or USB drive to use in the current example.

If you open this file in a text editor (such as Notepad), you'll see something that looks very much like the data set displayed above:

beverage.txt viewed in Notepad

Note that (in addition to the variable names, gender and beverage, in the first row) there are 41 rows, one for each case (i.e. each student), and two columns, one for each variable. This is the format that Data Desk expects to see when we work with a categorical variable.

To open the text file in Data Desk, start the Data Desk program, click on File and Import...,

click File then Import...

then navigate to the file beverage.txt that you saved on your computer's hard drive or USB drive, click on the filename and click Open.

single click on filename, then click Open

Click Use these variable names:

click Use these variable names

The data file is now open in Data Desk. You should see two variables here:

variables in Data Desk

but we only want to work with the beverage variable at the moment, so click on it to select it. A Y should appear over the variable's icon.

click on variable name to select as Y

Click Calc and Frequency Breakdowns:

click Calc and then Frequency Breakdowns

You should then see a frequency table (with relative frequencies included).

frequency table

With the beverage variable still selected as Y, you can click on Plot and then Bar charts:

click Plot and then Bar Charts

to create a bar chart from the beverage data:

bar chart of beverage data

Note that each bar is labeled with its category and that the variable ("beverage") is labeled along the horizontal axis; the scale for the counts is indicated on the vertical axis. (In some cases you may need to resize the bar chart window—by clicking on the symbol in the lower right corner of the window and dragging the mouse as you hold down the left mouse button—to see the full category names.)

You can easily make a pie chart for the same data set (make sure the beverage variable is still selected as Y) by clicking on on Plot and then Pie Charts:

pie chart of the beverage data

Note that the legend to the right of the pie chart tells us which slice of the pie corresponds with which category. If you're printing a report in black and white, you can click on the hyperview menu (the little triangle-shaped symbol in the upper-left corner of the pie chart window in Data Desk) and select Use Patterns:

click hyperview triangle, then click Use Patterns

You should then get a chart like this:

pie chart for beverage data using patterns instead of colors

A particularly useful feature of Data Desk is the ability to copy and paste graphs into other applications, such as Microsoft Word. To do this, click on the graph you want to copy (so that its window is the active window in Data Desk) then click on Edit and Copy Window:

click Edit then Copy Window

The graph is now on your computer's clipboard and you can paste it into a Microsoft Word file, for example, to create a report or as part of a HW solution.

Working with summary counts

When many statistical analysis programs (including Data Desk) work with categorical variables, they expect the data to be in "raw" form: in other words, a long table of entries with one row for each case and one column for each variable, as we saw with the file beverage.txt.

In practice, however, information like this is often given in summary form as a frequency table (as in excercise #15 of Chapter 3, or the initial way we summarized our beverage data) or a two-way table (when working with two categorical variables at once, as in exercise #19 of Chapter 3, or as below where we consider beverage preference and gender simultaneously). In the textbook, both exercises #15 and #19 have a red circle with a T inside next to the exercise number, indicating that the data set for that exercise is included on the CD that comes with the book (as well as on the Intro Stats Web site); in both cases the data in these files is given in summary format rather than as "raw data." Unfortunately, Data Desk expects to see the raw data, so what do we do?

Let's begin with the data set for #15. Get the data file (in text format or Data Desk format) from the CD or Web and save it to your desktop. (Note that the data files on the CD and Web do not include exercise numbers, just the chapter number and the topic of the problem; in this case the file is called Ch03_Auditing_reform.TXT or Ch03_Auditing_reform.ise.) If you open the text file in Notepad you'll see something like this:

tab-delimited text file for Ch. 3 #15, viewed in Notepad

Note that this data file is in summary format (like a frequency table) rather than in "raw data" format with one row for each case.

Close Notepad (if it's open) and open Data Desk, then import the file (you'll want to close any other data files that you have open in Data Desk first) just as we did above with the beverage data. Click on the Response variable so that a Y appears over the variable's icon, then hold down the SHIFT key and click on the Percent variable so that an X appears over that variable's icon:

click to select a variable as Y, shift-click to select as X

Now click on Manip and Replicate Y by X:

click Manip then Replicate Y by X

a new variable, Response:Percent, will appear:

derived variable

Click on this new variable (to select it as Y) and then click Plot and Pie Charts to create a pie chart:

pie chart for Ch. 3 #15

Note that we didn't really have counts in the original data file, but rather percentages. This worked fine, though, since the percentages were all rounded off to the nearest integer and we only wanted to create a pie chart, which is a graphical display of percentages, or relative frequencies. If the percentages had been of the form 39.2% this procedure would not have worked. The best practice in such a case would be to multiply all of the percentages by the total number of cases to get the original counts, then use Replicate Y by X to create a "raw data" file with one row for each case.

Gender vs. beverage

Now let's investigate both of the categorical variables (gender and beverage) in our original data set simultaneously. We might ask the question, "Are the beverage preferences the same for males and females?" Or, in other words, "Is beverage preference independent of gender?

One way to begin investigating this question is to create a two-way table, which we can do quite easily in this case by hand:


female male
Coke 7 9
Pepsi 10 4
neither 8 3

Note that each category of the beverage variable gets its own row, and each category of the gender variable gets its own column. This two-way table is a special type of two-way table called a contingency table: it is the result of surveying a single group (the 41 Statistics students) and classifying them according to two variables (gender and beverage). If we had considered two or more groups (say students in a Statistics class, students in a Calculus class and students in a Differential Equations class) and classified them according to a single variable (beverage preference) then we would have a two-way table but not a contingency table. (This difference is a subtle one, to be sure, but will become important in Chapter 26.)

What does the data tell us in contingency table form that we couldn't see before? It appears that males are more likely to prefer Coke and females more likely to prefer Pepsi or "neither." But is this difference significant? In other words, are the differences between males and females big enough to convince us that there really is a difference in beverage preferences between males and females? Or could male and female beverage preferences be about the same, but we just happened to get a class with fewer male Pepsi drinkers than would otherwise expect? This is a key question that we will spend most of this course developing techniques to answer systematically. Yet even now we can do a pretty good job of answering this question by following the Three Rules of Data Analysis: Make a picture, make a picture, make a picture!

Before we get to making an appropriate graphical display, however, let's spend a bit more time looking at this two-way table. Often it is helpful to include subtotals for each row and column, as follows:


female male total
Coke 7 9 16
Pepsi 10 4 14
neither 8 3 11
total 25 16 41

Note that the subtotals in the rightmost column are the same as in our original frequency table for the beverage variable; this is called a marginal distribution of the beverage variable (since it occurs in the margin of the table). The marginal distribution of the gender variable can be found in the bottom row of the table. Note that the subtotals for the beverage categories sum to 41 (the total number of cases), as do the subtotals for the gender variable. The subtotals allow us to compute relevant percentages more easily. For example:

What percentage of males drink Pepsi? 4/16 = 25%

What percentage of Pepsi drinkers are male? 4/14 ˜ 28.6%

We could also isolate the Pepsi drinkers:


female male total
Pepsi 10 4 14

or the males:


male
Coke 9
Pepsi 4
neither 3
total 16

In each of these cases we call the isolated row or column a conditional distribution. The first of these (with the Pepsi drinkers isolated) is the distribution of the gender variable under the condition that we only look at the Pepsi drinkers; the second is the distribution of beverage preference under the condition that the students are male.

Two categorical variables in Data Desk

To create a two-way table from our gender and beverage data in Data Desk, open the beverage.txt file in Data Desk as before, then click on beverage to select it as Y, then hold down the SHIFT key and click on gender to select it as X:

click on beverage to select as Y, then shift-click on gender to select as X

Now click on Calc and Contingency Tables.

click Calc and then Contingency Tables

You should see a table with the same summary counts as in our original two-way table:

contingency table of gender and beverage

Click on the hyperview menu of the active window (the small arrow in the upper-left corner), then select Table Options:

click the hyperview icon, then click Table Options

to access options to display row percentages, column percentages and/or table percentages in addition to (or instead of) counts.

Table Options dialog box

If we select "Percent of column total" (and deselect "Count") we get:

contingency table for beverage and gender (column percentages)

and we can see more easily by examining the percentages that there does seem to be a difference between the distribution of beverage preferences for males and females.

Now let's create a visual display that will allow us to compare males and females. With beverage again selected as Y, and gender again selected as X, click on Manip and Split into Variables by Group....

click Manip then Split inot variables by group ...

Data Desk will open up a new window called gender with two data sets called female and male:

beverage data split into two groups by gender

Now click on female to select it as Y, then hold down the CTRL key and click on male to also select it as Y:

click female then control-click male so both are selected as Y

Now click on Plot and Pie Charts to see side-by-side pie charts comparing the beverage preferences of male and females:

side-by-side pie charts comparing beverage preferences for males and females

We can clearly see that the pie charts appear to be quite different, which leads us to conclude that there is evidence that beverage preference and gender are not independent.

Note a couple of things about this last statement. We said that there was evidence that the two variables are not inependent; we did not state conclusively that the two variables were not independent. Another sample of 41 different students might lead us to a different conclusion.

We also did not say that the variables were dependent, but rather that they (appear to be) not independent. There's a subtle but important difference here. Saying "dependent" implies that one thing depends on the other, but we don't know that this is the case. Just because there is an association between two variables doesn't mean that there is a cause-and-effect relationship. It's possible that there is a lurking variable.

In this case, it's highly unlikely that beverage preference influences a person's gender! It could be the case that a person's gender influences their beverage preference, but it's also possible that Coke advertises its products more heavily to males and Pepsi to females. The key here is that we can't tell if one thing causes or influences another, merely that there's an association between the two.

Summary counts (again)

Now let's look at the contingency table in exercise #19. This is a contingency table because we have one group (students who applied to magnet schools) classified according to two variables (Ethnicity and Admission Decision).

Save the file Ch03_Magnet_schools_revisite.TXT or Ch03_Magnet_schools_revisite.ise from the CD or Intro Stats Web site to your desktop. If you examine this file using a text editor, you'll note that it looks like this:

exercise #19 data set viewed as a text file

Note that there is one column for each variable and one row for each possible combination of the values of these two variables, with summary counts in the third column. In order to analyze this data using Data Desk, we need to turn these summary counts into a "raw" data file, as we did before, although in this case it is slightly more complicated.

Import the file into Data Desk (after closing any previouly open data files) then click on the Ethnicity variable so that a Y appears over its icon, then hold down the CTRL key and click on the Admission Decision variable so that another Y appears over that variable's icon. Finally, hold down the SHIFT key and click on the Counts variable so that an X appears:

control-click for Y and shift-click for X

Now click on Manip and Replicate Y by X: two new variables, Ethnicity:Counts and Admission Decison:Counts will appear.

Ethnicity:Counts and Admission Decision:Counts

Click on Admission Decison:Counts so that a Y appears and then hold down SHIFT and click on Ethnicity:Counts so that an X appears:

click Admission Decision:Counts and shift-click Ethnicity:Counts

Now click Manip and Split into Variables by Group... to split the data into three groups, Asian, Black/Hispanic and White:

Admission Decision split into three groups by Ethnicity

We can now create a pie chart for each of these groups in order to compare them. Hold down the CTRL key and click each of the groups so that all three are selected as Y:

control-click each group to select as Y

and then click on Plot and Pie Charts to see side-by-side pie charts comparing the admission decisions for each ethnic group:

side-by-side pie charts comparing admission decisions across ethnicities

If the variables Ethnicity and Admission Decision were independent, we would expect each of these pie charts to look roughly the same; they don't, so we conclude that there is evidence that Ethnicity and Admission Decision are not independent.

ActivStats

The ActivStats CD offers more guidance in using Data Desk with categorical variables. View the lessons on pages 3-1 and 3-2, as time permits.

Exercises

Work exercises 7, 19, 21, 23 and 31 in Chapter 3. (You are of course encouraged to work additional problems.)

Errata

At the bottom of page 29, two of bars in the bar chart are labeled incorrectly: the "Small part of diet" bar should be labeled 7.7% (201/2621 ˜ 7.7% ? 8.3%) and the "Moderate part" bar should be labeled 7.0% (209/2978 ˜ 7.0% ? 6.9%).

Likewise, at the top of page 30, the overall percentage of men who contracted prostate cancer among men with small or moderate amounts of fish in their diet should be 7.3%, not 7.4%.

Exercise #1 should read: "Find a bar graph chart of categorical data..."

Part c in exercise #31 should read: "Compare these distributions with a segmented bar graph chart."

Part d in exercise #36 should read "Explain." (Not "Explain?")

Additional Resources

"The Question of Causation"
Episode 11 from Against All Odds explores some of the ideas of Chapter 3, including using segmented bar charts to do same sort of analysis we did with side-by-side pie charts above.
Introduction to Statistics
Carnegie Mellon's open source course has a lesson called "One Categorical Variable" that may also be of interest (see Unit 2, Module 1).
Bar chart tool
A Java applet for creating bar charts.
Pie chart tool
A Java applet for creating pie charts.
Create a Graph
Useful site for creating various types of graphs online.
Simpson's paradox
Another example of Simpson's paradox.
"Educational attainment in the United States"
US Census data referenced in Exercise #8 of Chapter 3.
"Polls show paranormal beliefs on the rise, evolution belief on the decline" (PDF, 21K)
Skeptic, Vol. 9 No. 1, page 10
Article about the Gallup poll referenced in Exercise #9 of Chapter 3.
"ATF report renews calls for gun control" (PDF, 18K)
Gary Fields
USA Today, June 22, 2000
Article referenced in Exercise #10 of Chapter 3.
"Complications from therapeutic modalities: results of a national survey of athletic trainers"
S. Nadler, et al.
Archives of Physical Medicine and Rehabilitation, Volume 84, Issue 6, Pages 849-853
Abstract of article mentioned in Exercise #16 of Chapter 3.
"Trends in Twin Birth Outcomes and Prenatal Care Utilization in the United States, 1981-1997"
JAMA, 2000;284:335-341
Abstract of article mentioned in Exercise #30 of Chapter 3.