Chapter 2: Data
The Six W's
Most of the problems in this chapter involve identifying what the authors call The W's: who, what, where, when, why and how. These are the same fundamental questions a journalist might ask when covering a story for a newspaper article. More specifically, we need to be able to answer the following questions:
- Who was studied? (As opposed to who did the studying.)
- Cases are the individual elements of the "Who"
- What was measured or recorded about about the "Who"? (Where appropriate, include the units!)
- Variables are the individual elements of the "What" (Include the type!)
- Where was the data collected?
- When was the data collected?
- Why was the data collected?
- How was the data collected?
The "who" and "what" are even more important than the others—if you don't know "who" or "what" you probably don't have any context in which to perform a statistical analysis. But the other W's are important, too—we won't always be able to answer all six of these questions, but knowing that we don't know the answer to a question often tells us something very valuable and leads us to ask other important questions.
Houses
To use as an example for this chapter, I collected some data about the single-family residences on the street where I live in Edmonds. I collected this data on October 3, 2006 from the Web site of the Snohomish County Assessor. For each house I recorded the house number, the size (in square feet), the 2007 assessed value (in thousands of dollars), the lot size (in acres), the 2006 taxes (in dollars) and the number of stories. Here is the complete data set:
| house | size | assess | lot | taxes | stories |
| 20911 | 1561 | 304 | 0.2 | 2604 | 1 |
| 20912 | 1038 | 297.6 | 0.2 | 280 | 1 |
| 20918 | 1224 | 289.5 | 0.17 | 2353 | 1 |
| 20921 | 1232 | 292.8 | 0.17 | 756 | 1 |
| 20924 | 1995 | 314.6 | 0.17 | 2620 | 2 |
| 20927 | 1714 | 322.7 | 0.18 | 2632 | 1 |
| 20930 | 1832 | 336.1 | 0.18 | 2779 | 2 |
| 21003 | 1095 | 279 | 0.18 | 2321 | 1 |
| 21006 | 2011 | 319.5 | 0.18 | 2663 | 2 |
| 21015 | 1366 | 289.3 | 0.18 | 2415 | 1 |
| 21018 | 1292 | 301.4 | 0.18 | 2477 | 1 |
| 21023 | 1458 | 314.3 | 0.18 | 1386 | 1 |
| 21028 | 2031 | 320.9 | 0.18 | 2676 | 2 |
| 21105 | 1366 | 304 | 0.18 | 2473 | 1 |
Let's identify the W's:
- Who: 14 houses
- Cases: each house is a case
- What: house number, size, 2006 assessed value, lot size, 2007 taxes, number of stories
- Variable: house number; Type: categorical (identifier?)
- Variable: size; Type: quantitative; Units: square feet
- Variable: 2006 assessed value; Type: quantitative; Units: thousands of dollars
- Variable: lot size; Type: quantitative; Units: acres
- Variable: 2007 taxes; Type: quantitative; Units: dollars
- Variable: stories; Type: ordinal
- When: data was collected on October 3, 2006
- Where: Edmonds, WA
- Why: to use as an example for this class
- How: data was found on the Web site of the Snohomish County Assessor
Notice that there are 14 rows in the data table given above (not including the header row with the variable names) and that each row in the table corresponds to a case (i.e. the "Who" corresponds to the rows). There are six columns and each column contains information about one variable (i.e. the "What" corresponds to the columns).
Notice also that the "Who" isn't a group of people in this case, it's a group of houses. The only person mentioned in the information given with the data set is the instructor for this course, but although I gathered the data, I'm not who is being studied here, the houses are.
The house numbers, although they consist of numbers, are categorical, not quantitative. (We might consider them to be an identifier since each house on this street has a unique number, but only if we were interested in houses on this street and nowhere else in the city of Edmonds; if we looked at houses on the adjacent street we might find a house with the same number as one of these 14.)
The next four variables are quantiative and notice that we specify the units in each case. Without units a quantitative variable is meaningless. (If you don't believe me, just ask NASA.)
The final variable, number of stories, could be considered quantitative, but since it only takes on two values in this data set, we essentially have two groups here (one-story houses and two-story houses) so it's really functioning as a categorical variable (albeit an ordinal one) in this context. If we had collected data about buildings in downtown Seattle, which might be as short as a single story or as tall as the 76-story Columbia Center, we would probably consider the number of stories to be a quantitative variable.
Although the "Who" and "What" are the most vital W's, the others are important as well. If we didn't know "Where" these houses were located the information wouldn't be very useful. It might also be helpful to know exactly where in Edmonds these houses are located; given the assessed values, they're certainly not near Puget Sound—if they had views the values would be at least double what they are here!
The "When" is important, too: if we want to use this information to learn about houses and taxes in Edmonds, it wouldn't do us much good to have tax data from 1960, when many of these houses were originally built. The "Why" may not be terribly interesting this case, but if the data was collected to argue against a property-tax increase, or to argue that such an increase would not negatively impact homeowners in Edmonds, we might wonder if these houses were representative of all houses subject to the proposed tax. The "How" allows someone to check the numbers by visiting the Web site and accessing the data at the original source.
We'll revisit this data set in later chapters when we examine quantitative variables.
Coke vs. Pepsi
Let's try one more example. On September 18, 2006, I administered a survey to a Statistics class. The survey asked several questions, among them the gender the student (male or female) and whether students preferred Coke-brand beverages, Pepsi-brand beverages, or neither. Here is the data set:
| gender | beverage |
| female | coke |
| female | coke |
| female | coke |
| female | coke |
| female | coke |
| female | coke |
| female | coke |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | pepsi |
| female | neither |
| female | neither |
| female | neither |
| female | neither |
| female | neither |
| female | neither |
| female | neither |
| female | neither |
| male | coke |
| male | coke |
| male | coke |
| male | coke |
| male | coke |
| male | coke |
| male | coke |
| male | coke |
| male | coke |
| male | pepsi |
| male | pepsi |
| male | pepsi |
| male | pepsi |
| male | neither |
| male | neither |
| male | neither |
Let's identify the W's:
- Who: 41 Statistics students
- Cases: each student is a case
- What: gender and beverage preference
- Variable: gender; Type: categorical
- Variable: beverage preference; Type: categorical
- When: September 18, 2006
- Where: Edmonds Community College
- Why: not specified
- How: in-class survey
Notice that there are 41 cases (hence 41 rows in the data table, not including the header row) and 2 columns (hence 2 columns in the table). Note that the first variable can take on two possible values ("male" or "female") while the second variable can take on one of three possible values: "Coke," "Pepsi" or "neither." When identifying the "What" be sure not to confuse the variables with the values that the variables can take on. We'll revisit this data set in the next chapter.
Exercises
Work exercises 3, 9, 21 and 25 in Chapter 2. (You are of course encouraged to work many more problems in addition to these.)
Errata
On page 10, Mantovani is misspelled—I know, you probably don't care about this one, even if you know who Mantovani was!
For later chapters I'll include more relevant typographical errors in these Resources. If you don't follow something in the text, check here to see if your confusion might be the result of a misprint. There aren't many typos in our textbook (at least compared to other math texts I've used) but if you think you find one not listed here, please let me know!
ActivStats
Work through pages 2-1 through 2-3 in ActivStats, as time permits. These activities shouldn't take very long, but they will offer you a chance to collect some data by playing a computer game and introduce you to working with Data Desk, the statistical analysis program included on the CD. (I'll include instructions for Data Desk beginning with Chapter 3, when we'll actually start using it.)
Additional Resources
- "Tanker Structure Behavior During Collision and Grounding" (PDF, 1363K)
- John C. Daidola.
Marine Technology. Vol. 32 No. 1. January 1995. pp. 20–32.
The article referenced in Exercise #3 of Chapter 2; you may find it interesting to look over the article after you work this problem to see if the information provided in the original article is more helpful in indentifying the W's than the summary provided in the textbook. - "Ages of Oscar-winning Best Actors and Actresses " (PDF, 162K)
- Richard Brown and Gretchen Davis.
Mathematics Teacher. Vol. 83 No. 2. February 1990. pp. 96–102
The article referenced in Exercise #4 of Chapter 2; the comments for the previous article apply here as well. - "Rapid changes in flowering time in British plants"
- Fitter, A.H. and Fitter, R.S.R.
Science. 296 (5573). May 31, 2002. pp. 1689–1691
Abstract (with a link to a PDF of the full article) of the article referenced in Exercise #10 of Chapter 2. - "Plants found blooming earlier in the spring"
- USA Today. May 30, 2002.
Associated press report about the previously cited Science article. - "Cardiorespiratory fitness and smoking-related and total cancer mortality in men"
- Do Lee, Chong; Blair, Steven N.
Medicine & Science in Sports & Exercise. 34(5): 735–739, May 2002.
Abstract of article mentioned in Exercise #11 of Chapter 2. - "Refrigerators" (PDF, 715K)
- Consumer Reports. Vol. 67 No. 8. August 2002. p. 25.
Article referenced in Exercise #21 of Chapter 2. - "Surprises From Self-Experimentation: Sleep, Mood, and Weight" (PDF, 233K)
- Seth Roberts.
Chance. Vol. 4 No. 2. 2002. pp. 7–17.
Article referenced in Exercise #23 of Chapter 2.