Lab 2 Central Measures

2.1 Introduction

It is often desirable to characterize an entire vector of numbers with a single value and the number that is “in the middle” of the vector would seem most logical to use. Students in elementary school are taught how to find the average of a group of numbers and they learn that the average is the best representation for that entire group. In statistics, though, there are several different numbers that are often used to represent an entire vector of numbers and these are collectively known as the Central Measures, or numbers that are the “middle” of the vector.

A vector is nothing more than a list of related items. For example, a vector of street names may look like {“elm”, “main”, “first”, “raven”} and a vector of test scores may look like {83, 91, 76, 88}.

2.2 N

One of the simplest of measures is nothing more than the number of items in a vector. For example, for the vector 5, 7, 13, 22 the N (number of items) is 4. Technically, N does not identify the middle of a vector but it is an important measure and is included here for completeness.

2.2.1 Demonstration: N

To find the number of items in a vector ( N ), use the length procedure as found in the R script below.

  • Line 1: Any line in R that starts with a hash is a comment and is ignored when the script is executed. (Fun fact, the official name for that symbol is “octothorpe”.)
  • Line 2: The length of the eruptions vector in the faithful data frame is printed, which is the same as the number of items in the vector, or N. Notice how a vector is identified using the data frame name then the vector name with a dollar sign operator.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpbmRpbmcgTlxubGVuZ3RoKGZhaXRoZnVsJGVydXB0aW9ucylcbiJ9

2.2.2 Guided Practice: N

Using the attitude data frame, find N for critical.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBGaW5kIE4gZm9yICdjcml0aWNhbCcgaW4gdGhlICdhdHRpdHVkZScgZGF0YSBmcmFtZVxuIiwic29sdXRpb24iOiJcbmxlbmd0aChhdHRpdHVkZSRjcml0aWNhbClcbiIsInNjdCI6IlxuZXgxbmVxIDwtIFwiTWFrZSBzdXJlIHRoZSB2ZWN0b3IgbmFtZSBpbmNsdWRlcyB0aGUgZGF0YSBmcmFtZTogYXR0aXR1ZGUkY3JpdGljYWwuXCJcblxuZXgoKSAlPiVcbiAgY2hlY2tfZnVuY3Rpb24oXCJsZW5ndGhcIikgJT4lXG4gIGNoZWNrX3Jlc3VsdCgpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgxbmVxKVxuXG5zdWNjZXNzX21zZyhcIlBlcmZlY3QhIE5vdyB5b3Uga25vdyBob3cgbWFueSBpdGVtcyBhcmUgaW4gdGhlICdjcml0aWNhbCcgdmVjdG9yIVwiKVxuIn0=

2.2.3 Activity: N

Using the cafe data, find the number of items in the age vector and include that number in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgRmluZCB0aGUgbnVtYmVyIG9mIGl0ZW1zIGluIHRoZSBhZ2UgdmVjdG9yLlxuXG5cbiJ9

2.3 Mean

The mean is calculated by adding all of the data items together and then dividing that sum by the number of items ( N ), which is taught in elementary school as the average. For example, given the vector: 6, 8, 9, the total is 23 and that divided by 3 ( N ) is 7.66; so the mean of 6, 8, 9 is 7.66.

2.3.1 Demonstration: Mean

To find the mean of a vector, use the mean procedure as found in the R script below.

  • Line 1: Any line in R that starts with a hash is a comment and is ignored when the script is executed.
  • Lines 3-4: Calculate the means for eruptions and waiting in the faithful data frame.
  • Lines 9-12: Calculate the means for various variables in the iris data frame.

This demonstration contains six different examples of finding the mean.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpbmQgbWVhbnMgZm9yIGZhaXRoZnVsXG5tZWFuKGZhaXRoZnVsJGVydXB0aW9ucylcbm1lYW4oZmFpdGhmdWwkd2FpdGluZylcblxuIyBGaW5kIG1lYW5zIGZvciBpcmlzXG5tZWFuKGlyaXMkU2VwYWwuV2lkdGgpXG5tZWFuKGlyaXMkU2VwYWwuTGVuZ3RoKVxubWVhbihpcmlzJFBldGFsLldpZHRoKVxubWVhbihpcmlzJFBldGFsLkxlbmd0aCkifQ==

2.3.2 Guided Practice: Mean

Using the attitude data frame, find the mean for critical.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBGaW5kIHRoZSBtZWFuIGZvciAnY3JpdGljYWwnIGluIHRoZSAnYXR0aXR1ZGUnIGRhdGEgZnJhbWVcblxuXG4iLCJzb2x1dGlvbiI6IlxubWVhbihhdHRpdHVkZSRjcml0aWNhbClcbiIsInNjdCI6IlxuZXgybmVxIDwtIFwiTWFrZSBzdXJlIHRoZSB2ZWN0b3IgbmFtZSBpbmNsdWRlcyB0aGUgZGF0YSBmcmFtZTogYXR0aXR1ZGUkY3JpdGljYWwuXCJcblxuZXgoKSAlPiVcbiAgY2hlY2tfZnVuY3Rpb24oXCJtZWFuXCIpICU+JVxuICBjaGVja19yZXN1bHQoKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4Mm5lcSlcblxuc3VjY2Vzc19tc2coXCJQZXJmZWN0ISBOb3cgeW91IGtub3cgdGhlIG1lYW4gZm9yIHRoZSAnY3JpdGljYWwnIHZlY3RvciFcIilcbiJ9

2.3.3 Activity: Mean

Using the cafe data frame, find the mean for age. Include that answer in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YSBmcmFtZSwgZmluZCB0aGUgbWVhbiBmb3IgYWdlLlxuXG5cbiJ9

2.4 Trimmed Mean

If a vector has values that are unusually large or small, then the mean is often skewed such that it no longer represents the “average” value. As an example, the weight, in grams, of chicks being fed different diets over 21 days ranges from 35 to 373. This is a huge spread with several unusually high weights (some chicks seemed to fare much better on their diets than others). In the end, the mean weight, 121.82g, is not representative of the entire group because those few heavy chicks tended to skew the mean upward. One way to compensate for unusual high, or low, values in a vector is to use a trimmed mean (sometimes called a truncated mean).

A trimmed mean is calculated by removing a specific number of values from both the top and bottom of the vector and then finding the mean of the remaining values. In the case of the chick weights, the following table shows the weight mean with various amounts trimmed.

Table 2.1: Trimmed Mean
Trim Mean
0.00 121.82g
0.05 116.48g
0.10 113.18g
0.15 110.83g

The untrimmed mean is 121.82g, but as values are trimmed the unusually high weights are removed. Thus, the trimmed mean better represents the entire group of chick weights. It is not possible, or desirable, to trim a different amount from the top and bottom, a trimmed mean always trims the same amount from both ends of the vector. In actual practice, a trimmed mean is not commonly used since it is difficult to know how much to trim and the resulting mean may be just as skewed as if no values were trimmed; thus, when unusual values are suspected, the best “middle” term to report is the median.

2.4.1 Demonstration: Trimmed Mean

The trimmed mean can be easily found in R by adding a “trim” parameter to the mean command, as demonstrated in the following script.

  • Line 1: Any line in R that starts with a hash is a comment and is ignored when the script is executed.
  • Lines 2-4: calculate the trimmed mean by including “trim=…” in the mean function. Line 2 is the mean for the accel vector in the attenu data frame, Line 3 trims 15% from each end of the vector and Line 4 trims 25%.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpbmQgdHJpbW1lZCBtZWFucyBmb3IgYXR0ZW51XG5tZWFuKGF0dGVudSRhY2NlbClcbm1lYW4oYXR0ZW51JGFjY2VsLCB0cmltPTAuMTUpXG5tZWFuKGF0dGVudSRhY2NlbCwgdHJpbT0wLjI1KVxuIn0=

2.4.2 Guided Practice: Trimmed Mean

Using the attitude data frame, find the mean for critical with 15% trimmed.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBGaW5kIHRoZSBtZWFuIGZvciAnY3JpdGljYWwnIHdpdGggMTUlIHRyaW1tZWQgaW4gdGhlICdhdHRpdHVkZScgZGF0YSBmcmFtZVxuXG4iLCJzb2x1dGlvbiI6IlxubWVhbihhdHRpdHVkZSRjcml0aWNhbCwgdHJpbT0wLjE1KVxuIiwic2N0IjoiXG5leDNuZXEgPC0gXCJDaGVjayB0aGUgdmVjdG9yIG5hbWUgKGF0dGl0dWRlJGNyaXRpY2FsKSBhbmQgdGhlIGFtb3VudCBvZiB0aGUgdHJpbSAoMC4xNSkuXCJcblxuZXgoKSAlPiVcbiAgY2hlY2tfZnVuY3Rpb24oXCJtZWFuXCIpICU+JVxuICBjaGVja19yZXN1bHQoKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4M25lcSlcblxuc3VjY2Vzc19tc2coXCJQZXJmZWN0ISBZb3UgY2FuIGNhbGN1bGF0ZSB0aGUgdHJpbW1lZCBtZWFuIG9mIGEgdmVjdG9yIVwiKVxuIn0=

2.4.3 Activity: Trimmed Mean

Using the cafe data frame, find the mean for miles with 15% trimmed. Include that answer in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YSBmcmFtZSwgZmluZCB0aGUgbWVhbiBmb3IgbWlsZXMgd2l0aCAxNSUgdHJpbW1lZC5cblxuXG4ifQ==

2.5 Median

The median is found by listing all of the data items in numeric order and then mechanically finding the middle item. For example, using the vector 6, 8, 9, the middle item (or median) is 8. If the vector has an even number of items, then the median is calculated as the mean between the two middle items. For example, in the vector 6, 8, 9, 13 the median is 8.5, which is the mean of 8 and 9, the two middle terms.

The median is very useful in cases where the vector has unusually small or large values. As an example of using a median rather than a mean, consider the vector 5, 6, 7, 8, 30. The mean is (5+6+7+8+30)/5 = 56/5 = 11.2. However, 11.2 is clearly much higher than most of the other numbers in that vector since one unusually high value, 30, is significantly skewing the mean upward. A much better representation of the center of this vector is 7, which is the median. To re-visit the chick weights used above, the median of that vector is 103, which is much more representative of the “middle” weight than using either the mean or the trimmed mean.

As another example where the median is the best central measure, suppose a newspaper reporter wanted to find the “average” wage for a group of factory workers. The ten workers in that factory all have an annual salary of $25,000; however, the supervisor has a salary of $125,000. In the newspaper article, the supervisor is quoted as saying that his workers have an average salary of $34,090. That is correct if the mean of all those salaries (including the supervisor’s) is reported, but that number is clearly higher than any sort of reasonable “average” salary for workers in the factory due to the one unusually large salary. In this case, the median of $25,000 would be much more representative of the “average” salary. The median is typically reported for salaries, home values, and other vectors where one or two unusually small or large values typically distort the reported “middle” value.

If the vector contains no unusually small or large values, then the mean and median are the same; but if those values exist then these two measures become separated, often by a large amount. Consider the weight vector from the ChickWeight data frame mentioned in the Mean section above. That vector has a mean of 121.82 and a median of 103. This difference, 18.82, is about 15% of the mean and is significant. The size of this difference indicates that there are unusually small or large values in the vector that may be skewing the mean.

2.5.1 Demonstration: Median

To find the median of a vector, use the median procedure as found in the R script below.

  • Lines 2-3: Calculate the medians for eruptions and waiting in the faithful data frame.
  • Lines 6-9: Calculate the medians for various vectors in the iris data frame.

This demonstration contains six different examples of finding the median.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpbmQgbWVkaWFucyBmb3IgZmFpdGhmdWxcbm1lZGlhbihmYWl0aGZ1bCRlcnVwdGlvbnMpXG5tZWRpYW4oZmFpdGhmdWwkd2FpdGluZylcblxuIyBGaW5kIG1lZGlhbnMgZm9yIGlyaXNcbm1lZGlhbihpcmlzJFNlcGFsLldpZHRoKVxubWVkaWFuKGlyaXMkU2VwYWwuTGVuZ3RoKVxubWVkaWFuKGlyaXMkUGV0YWwuV2lkdGgpXG5tZWRpYW4oaXJpcyRQZXRhbC5MZW5ndGgpXG4ifQ==

2.5.2 Guided Practice: Median

Using the attitude data frame, find the median for critical.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBGaW5kIHRoZSBtZWRpYW4gZm9yICdjcml0aWNhbCcgaW4gdGhlICdhdHRpdHVkZScgZGF0YSBmcmFtZVxuXG4iLCJzb2x1dGlvbiI6IlxubWVkaWFuKGF0dGl0dWRlJGNyaXRpY2FsKVxuIiwic2N0IjoiXG5leDRuZXEgPC0gXCJDaGVjayB0aGF0IHRoZSB2YXJpYWJsZSBuYW1lIGluY2x1ZGVzIGJvdGggdGhlIGRhdGEgZnJhbWUgYW5kIHZlY3RvciAoYXR0aXR1ZGUkY3JpdGljYWwpLlwiXG5cbmV4KCkgJT4lXG4gIGNoZWNrX2Z1bmN0aW9uKFwibWVkaWFuXCIpICU+JVxuICBjaGVja19yZXN1bHQoKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4NG5lcSlcblxuc3VjY2Vzc19tc2coXCJQZXJmZWN0ISBZb3UgY2FuIGNhbGN1bGF0ZSB0aGUgbWVkaWFuIG9mIGEgdmVjdG9yIVwiKVxuIn0=

2.5.3 Activity: Median

Using the cafe data frame, find the median for bill. Include that answer in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YSBmcmFtZSwgZmluZCB0aGUgbWVkaWFuIGZvciBiaWxsLlxuXG5cbiJ9

2.6 Mode

The mode is used to describe the center of nominal or ordinal data and is nothing more than the item that was most commonly found in the vector. For example, if a question asked respondents to select their zip code from a list of five local codes and “12345” was selected more often than any other then that would be the mode for that item. Calculating the mode is no more difficult than counting the number of times the various values are found and choosing the one that is most frequent. As an example, the mtcars data frame lists the following for the number of cylinders in each car:

Cylinders N
4 11
6 7
8 14

Since 14 cars have 8 cylinders and that is the most common number of cylinders, that would be the mode for this data item.

It does not make much sense to calculate the mean or median for categorical data since those have no numeric value; however, reports frequently list the mean for Likert-style questions (ordered categorical data) by equating each level of response to a number and then calculating the mean of those numbers. For example, imagine that a student housing survey asked respondents to select among “Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree” for a statement like “I like the food in the cafeteria.” That is clearly categorical data and while “Agree” is somehow better than “Disagree” it would be wrong to try to quantify that difference as “two points better.” Sometimes, though, researchers will assign a point value to those responses like “Strongly Disagree” is one point, “Disagree” is two points, and so forth. Then they will calculate the mean for the responses on a survey item and report something like “The question about the food in the cafeteria had a mean of 3.24.” It would be impossible to know how to interpret that sort of number. Are students 0.24 units above “Neutral” on liking the cafeteria food?

There is no procedure for finding mode in this tutorial but the tutorial on [Frequency Tables] shows an easy to determine the mode of a variable. In fact, the mtcars$cylinders table above was created with a table command in R and that was used to determine the mode for cylinders.


2.7 Deliverable

Complete the activities in this lab and consolidate the responses into a single document. Name the document with your name and “Lab 2,” like “George Self Lab 2” and submit that document for grade.