Lab 1 Introduction to R

Statistical analysis is the core for nearly all research projects and researchers have a wide variety of statistical tools that they can use, like SPSS and SAS. Unfortunately, these analysis tools are expensive or difficult to master so this lab manual introduces R, a powerful, open source statistical analysis program that is available free of charge. But before diving into a statistics package there is one important background fundamental that must be covered: data.


1.1 Types of Data

There are two main types of data and it is important to understand the difference between them since that determines appropriate analytical tests.

  • Continuous data are integer or decimal numbers and are typically used for counts or measures – like a person’s weight, a tree’s height, or a car’s speed. Continuous data are measured with scales that have equal divisions so the difference between any two values can be calculated. Because continuous data include characteristics like means and standard deviations, they are analyzed using parametric tests.

  • Categorical data group observations into a limited number of categories; for example, type of pet (cat, dog, bird, etc.) or place of residence (Arizona, California, etc.). One common type of categorical data is generated with an “agree-disagree” type of scale, like “I enjoy reading: Strongly Agree : Agree : Neutral : Disagree : Strongly Disagree.” Because categorical data do not have characteristics like means or standard deviations, they are analyzed using nonparametric tests.


1.2 The R Command Line

All R commands are entered from a “Command Line” environment. Many students find this a bit challenging at first but once they learn some foundational concepts the command line becomes easy and fast to use. This is an explanation for the R script in the box below.

1.2.1 Demonstration: The R Command Line

  • Line 1: This is a comment that is used to record notes in a script. In R, all comments start with a hash-mark (#) and everything after that symbol is ignored. Comments are used frequently in scripts presented in this manual in order to explain what the script is doing. Good programmers comment liberally so team members can easily figure out what they did.

  • Line 2: Calculate the value of 3+5.

  • Line 3: Calculate the value of “5 + 8 * 2”.

  • Lines 6-7: These lines create two variables, MaxScore and MinScore, and then assign values to the variables. You should note two important things about these lines. First, the “assignment” operator is a less than sign followed by a hyphen, making a leftpointing arrow like <-. That tells R to store the number on the right side of the arrow operator into the variable named on the left side of the line. Also, keep in mind that capitalization matters with R. Thus, the variable named MaxScore would be different than a variable named maxscore. These lines only store values in variables and nothing gets printed to the screen.

A variable is nothing more than a place in memory to store temporary data. Think of it as a “box” that is used to store something until it is needed later.
  • Line 8: The variable Range is filled with the result of subtracting MaxScore minus MinScore.

  • Line 9: Entering a variable name, like Range, on a line by itself causes the value stored in that variable to be displayed.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFNvbWUgYmFzaWMgY2FsY3VsYXRpb25zXG4zICsgNVxuNSArIDggKiAyXG5cbiMgVXNpbmcgdmFyaWFibGVzXG5NYXhTY29yZSA8LSA1N1xuTWluU2NvcmUgPC0gMjJcblJhbmdlIDwtIE1heFNjb3JlIC0gTWluU2NvcmVcblJhbmdlXG5cbiJ9
The DataCamp blocks found throughout this lab manual are designed to “try out” R commands. Click the yellow “Run” button in the lower left corner to execute the entire script and see the results. There are two panels available in the box. Clicking script.R in the top of the box displays the R script and clicking R Console displays the results of executing the script. The commands in the script can be modified and rerun in order to “play around” with R. Also, R commands can be entered directly in the R Console and executed by tapping the ENTER key.

1.2.2 Guided Practice: The R Command Line

Now is a time to try some of the command line skills demonstrated above. In the following R codebox, calculate these values.

  • Set the variable M equal to 15 + ( 37 * 2 )

  • Print the value of M

  • Set the variable N equal to 24 + ( 6 / 2 )

  • Print the value of N

  • Set the variable P equal to 15 / ( 3 + 2 )

  • Print the value of P

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBTZXQgTSB0byAxNSArICggMzcgKiAyIClcblxuXG4jIFByaW50IHRoZSB2YWx1ZSBvZiBNXG5cblxuIyBTZXQgTiB0byAyNCArICggNiAvIDIgKVxuXG5cbiMgUHJpbnQgdGhlIHZhbHVlIG9mIE5cblxuXG4jIFNldCBQIHRvIDE1IC8gKCAzICsgMiApXG5cblxuIyBQcmludCB0aGUgdmFsdWUgb2YgUFxuXG5cbiIsInNvbHV0aW9uIjoiXG5NIDwtIDE1ICsgKCAzNyAqIDIgKVxuXG5NXG5cbk4gPC0gMjQgKyAoIDYgLyAyIClcblxuTlxuXG5QIDwtIDE1IC8gKCAzICsgMiApXG5cblBcblxuIiwic2N0IjoiXG5leDAxLjAxLjAxbmVxIDwtIFwiQ2hlY2sgeW91ciBpbnB1dCBmb3IgTS4gSXQgc2hvdWxkIGJlIDE1ICsgKCAzNyAqIDIgKS5cIlxuZXgwMS4wMS4wMm5lcSA8LSBcIkNoZWNrIHlvdXIgaW5wdXQgZm9yIE4uIEl0IHNob3VsZCBiZSAyNCArICggNiAvIDIgKS5cIlxuZXgwMS4wMS4wM25lcSA8LSBcIkNoZWNrIHlvdXIgaW5wdXQgZm9yIFAuIEl0IHNob3VsZCBiZSAxNSAvICggMyArIDIgKS5cIlxuXG5leCgpICU+JSBjaGVja19vYmplY3QoXCJNXCIpICU+JSBcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MDEuMDEuMDFuZXEpXG5cbmV4KCkgJT4lIGNoZWNrX29iamVjdChcIk5cIikgJT4lIFxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgwMS4wMS4wMm5lcSlcblxuZXgoKSAlPiUgY2hlY2tfb2JqZWN0KFwiUFwiKSAlPiUgXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDAxLjAxLjAzbmVxKVxuXG5zdWNjZXNzX21zZyhcIkdvb2QgV29yayEgTm93IHlvdSBjYW4gdXNlIFIgdG8gY2FsY3VsYXRlIG1hdGggcHJvYmxlbXMhXCIpXG4ifQ==

1.2.3 Activity: The R Command Line

Calculate the following problems:

  • 5 + 8 * 3
  • 10 * 2/4
  • 5 * 3 + 3 - 20

Include the answers to these three problems in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBGaW5kIHRoZSB2YWx1ZSBvZiA1ICsgOCAqIDNcblxuIyBGaW5kIHRoZSB2YWx1ZSBvZiAxMCAqIDIvNFxuXG4jIEZpbmQgdGhlIHZhbHVlIG9mIDUgKiAzICsgMyAtIDIwXG5cblxuIn0=

1.3 Data Frames

A data frame is a collection of data generated during a research project. An example data frame that is easy to understand would be a spreadsheet that contains the times recorded for a race. R comes configured with a 103 built-in data frames used for training and the R script below is an introduction to one of the data frames used in several of the labs in this manual: stackloss.

1.3.1 Demonstration: Data Frames

  • Line 1: Any line in R that starts with a hash tag is a comment and will be ignored by R.

  • Line 2: Entering the name of the data frame, stackloss, on a line by itself causes R to print the contents of the entire the data frame to the screen. Since stackloss is rather small it is fine to print it to the screen, but some data frames have hundreds of lines and that may cause the screen to “scroll” for some time before the end of the data frame is reached.

  • Line 3: This prints the structure of the stackloss data frame. The result shows that this is a data.frame and has 21 observations (that is how many lines are in the dataset) of 4 variables (things like Air.Flow). Also the structure command displays the type of data that are in the dataset. For example, all 4 variables are of the “number” type. The str function is frequently used to better understand a data frame. The structure for all of the data frames used in this book is contained in the appendix.

  • Line 4: This line puts the maximum Air.Flow value for the stackloss data frame into a variable named MaxAirFlow. Note that the specific vector in the data frame is indicated by both the data frame and vector names separated by a dollar sign, like stackloss$Air.Flow on this line.

  • Line 5: Entering the name of a variable on a line by itself displays the value of that variable so this line displays the maximum value of MaxAirFlow.

If some of the lines in the result are too long to fit on one row in the R Console they will wrap around.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEV4cGxvcmluZyB0aGUgU3RhY2sgTG9zcyBEYXRhIEZyYW1lXG5zdGFja2xvc3NcbnN0cihzdGFja2xvc3MpXG5NYXhBaXJGbG93IDwtIG1heChzdGFja2xvc3MkQWlyLkZsb3cpXG5NYXhBaXJGbG93XG5cbiJ9

1.3.2 Guided Practice: Data Frames

In the following R codebox, explore the airquality data frame.

  • Look at the structure of the airquality data frame.

  • Set MaxWind to the maximum value of Wind in the airquality data frame.

  • Display the value of MaxWind

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBWaWV3IHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGFpcnF1YWxpdHkgZGF0YSBmcmFtZVxuXG5cbiMgU2V0IE1heFdpbmQgdG8gdGhlIG1heGltdW0gdmFsdWUgb2YgV2luZFxuXG5cbiMgRGlzcGxheSB0aGUgdmFsdWUgb2YgTWF4V2luZFxuXG5cbiIsInNvbHV0aW9uIjoic3RyKGFpcnF1YWxpdHkpXG5cbk1heFdpbmQgPC0gbWF4KGFpcnF1YWxpdHkkV2luZClcblxuTWF4V2luZFxuXG4iLCJzY3QiOiJcbmV4MDEuMDIuMDFuZXEgPC0gXCJDaGVjayB0byBiZSBzdXJlIHlvdSBmb3VuZCB0aGUgc3RydWN0dXJlIGZvciBhaXJxdWFsaXR5LiBJdCBzaG91bGQgYmUgc3RyKGFpcnF1YWxpdHkpLlwiXG5leDAxLjAyLjAybmVxIDwtIFwiQ2hlY2sgeW91ciBpbnB1dCBmb3IgTWF4V2luZC4gSXQgc2hvdWxkIGJlIE1heFdpbmQgPC0gbWF4KGFpcnF1YWxpdHkkV2luZCkuXCJcbmV4MDEuMDIuMDNuZXEgPC0gXCJEaWQgeW91IHJlbWVtYmVyIHRvIHByaW50IHRoZSB2YWx1ZSBvZiBNYXhXaW5kPyBUbyBkbyB0aGF0IHB1dCAnTWF4V2luZCcgb24gYSBsaW5lIGJ5IGl0c2VsZi5cIlxuXG5leCgpICU+JSBcbiAgY2hlY2tfZnVuY3Rpb24oXCJzdHJcIikgJT4lIFxuICBjaGVja19hcmcoXCJvYmplY3RcIikgJT4lIFxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgwMS4wMi4wMW5lcSlcblxuZXgoKSAlPiUgXG4gIGNoZWNrX29iamVjdChcIk1heFdpbmRcIikgJT4lIFxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgwMS4wMi4wMm5lcSlcblxuZXgoKSAlPiVcbiAgY2hlY2tfb3V0cHV0X2V4cHIoLiwgTWF4V2luZCwgbWlzc2luZ19tc2c9ZXgwMS4wMi4wM25lcSlcblxuc3VjY2Vzc19tc2coXCJQZXJmZWN0ISBUaGFua3MgZm9yIGV4cGxvcmluZyB0aGlzIGJ1aWx0LWluIGRhdGEgZnJhbWUhXCIpXG4ifQ==

1.3.3 Activity: Data Frames

Using the cafe data, find the maximum values for bill and age and include those values in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgRmluZCB0aGUgbWF4aW11bSB2YWx1ZSBvZiBiaWxsLlxuXG5cbiMgRmluZCB0aGUgbWF4aW11bSB2YWx1ZSBvZiBhZ2UuXG5cbiJ9

1.4 Deliverable

Complete the activities in this lab and consolidate the responses into a single document. Name the document with your name and “Lab 1,” like “George Self Lab 1” and submit that document for grade.