Lab 8 Visualizing Data

8.1 Introduction

It is often very useful to visualize data frames rather than simply observe a table. Visualizations can make subtle relationships evident, and those relationships could be missed by just looking at a data table. Earlier tutorials described visualization techniques for descriptive and frequency data and this tutorial applies the same consideration to quantitative data.


8.2 Histogram

A histogram is a graph that shows the disribution of data that are continuous in nature, for example, age or height. A histogram resembles a bar chart but there is an important difference: a histogram is used for continuous data while a bar chart is used for categorical data. To emphasize that difference, histograms are normally drawn with no space between the bars (the data are continuous along the entire x-axis) while bar charts are normally drawn with a small space between bars (the data are categorical along the x-axis). As an example of a histogram, the next figure shows the New York city high temperature from May to September 1973 from the airquality data frame.

Notice that there is not a separate bar for each temperature; rather, R has clustered five temperatures into a single bar. Thus, there is a bar that combines the temperatures 70-74 and not separate bars for each of those temperatures. This histogram helps researchers determine if temperature is normally distributed, that is, does it display a typical “bell” curve. It should be fairly obvious from this graph that there are more temperatures around the 80° level than at either extreme so this histogram does indicate that the data are normally distributed.

As another example, the following figure shows a histogram for the body temperature of a beaver recorded every 10 minutes over the course of several hours, as found in the [beaver1] data frame. This histogram shows normally-distributed data though there is an outlier at a temperature of 37.6°.

Histograms can also indicate data that are skewed and this would be important to researchers during a project’s exploratory phase. Consider, for example, the following figure which is the shape of petroleum rock samples in the rock data frame. While this histogram still indicates a normal distribution with levels falling off from a peak, there is a longer “tail” to the right so the data have a positive skew.

Finally, consider the histogram in the following figure, which is taken from the peri vector in the rock data frame. This shows a bi-modal distribution where there are two clear peaks in the data. It is critical for researchers to know that the data are bi-modal before they begin analyzing it since having two modes can cause certain statistical tests to fail.

8.2.1 Demonstration: Histogram

The following script creates an example histogram.

  • Line 2: This is the beginning of the histogram function (it ends on Line 8). For this histogram the Speed variable from the morley data frame is specified as the data source for the histogram.
  • Lines 3-5: This creates the main title of the histogram along with the labels for the x-axis and y-axis.
  • Line 6: To create a histogram, R analyzes the values contained in a variable and creates “bins” for those values. That means that many of the continuous values will be grouped into a single bin for analysis. The “breaks” parameter tells R how many breaks to allow in the variable. In this case, eight breaks are specified, which would create nine bins. R will analyze the data and use the “breaks” parameter as a “suggestion” and will only use that number of breaks if it makes sense for the data being graphed. Often, changing the number of breaks by just one or two will not change the histogram produced so researchers should play around with the “breaks” number to get the best possible representation of the data.
  • Line 7: This specifies that 10 colors will be used from the “cm.color” palette to shade the various bars in the histogram. Researchers need to experiment a bit with the color palette and number of colors to get the best result; however, “cm.color” along with the number of bars in the histogram seems to work well for this histogram. (Note: More information about color can be found in the About Colors section in the appendix.)
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEhpc3RvZ3JhbVxuaGlzdChtb3JsZXkkU3BlZWQsXG4gIG1haW4gPSBcIk1vcmxleSdzIEV4cGVyaW1lbnRcIixcbiAgeGxhYiA9IFwiU3BlZWRcIixcbiAgeWxhYiA9IFwiRnJlcXVlbmN5XCIsXG4gIGJyZWFrcyA9IDgsXG4gIGNvbCA9IGNtLmNvbG9ycygxMClcbilcbiJ9
The DataCamp interface generates graphics in a Plots tab but because of the size of the interface those plots are “squished” and impossible to read. Click the double-headed arrow button on the Plots tab to open the graph in a larger window for evaluation and copying to a document. If the graphic does not open in a larger window then temporarily pause the browser’s pop-up blocker.

8.2.2 Guided Practice: Histogram

Using the USJudgeRatings data frame, create a histogram of the INTG vector. The histogram should meet these specifications:

  • Title: US Judge Ratings
  • X-axis label: Integrity Rating
  • Y-axis label: Frequency
  • Breaks: 11
  • Color: eight colors from the cm.colors palette
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBDcmVhdGUgdGhlIGhpc3RvZ3JhbSBhcyBzcGVjaWZpZWQgYWJvdmUuXG5cbiIsInNvbHV0aW9uIjoiXG5oaXN0KFVTSnVkZ2VSYXRpbmdzJElOVEcsXG4gIG1haW4gPSBcIlVTIEp1ZGdlIFJhdGluZ3NcIixcbiAgeGxhYiA9IFwiSW50ZWdyaXR5IFJhdGluZ1wiLFxuICB5bGFiID0gXCJGcmVxdWVuY3lcIixcbiAgYnJlYWtzID0gMTEsXG4gIGNvbCA9IGNtLmNvbG9ycyg4KVxuKVxuIiwic2N0IjoiXG4jIFRoZXNlIGFyZSBudW1iZXJlZCBieSB0aGUgZXhlcmNpc2UgbnVtYmVyIGFuZCBhIHNlcXVlbmNlIHdpdGhpbiB0aGUgZXhlcmNpc2UuXG4jIFRoZSBzZXF1ZW5jZSBudW1iZXIgaXMgbm90IG5lY2Vzc2FyeSBpZiB0aGVyZSBpcyBvbmx5IG9uZSBjaGVjay5cbmV4MTFuZXEgPC0gXCJDaGVjayB0aGUgaGlzdG9ncmFtIHNldHRpbmdzIHRvIGJlIHN1cmUgdGhleSBtYXRjaCB0aGUgcmVxdWlyZW1lbnQuXCJcbmV4MTJuZXEgPC0gXCJDaGVjayB0aGF0IHRoZSBoaXN0b2dyYW0gaXMgdXNpbmcgdGhlIFVTSnVkZ2VSYXRpbmdzJElOVEcgdmVjdG9yLlwiXG5leDEzbmVxIDwtIFwiQ2hlY2sgdGhhdCBtYWluIGlzIFVTIEp1ZGdlIFJhdGluZ3MuXCJcbmV4MTRuZXEgPC0gXCJDaGVjayB0aGF0IHhsYWIgaXMgSW50ZWdyaXR5IFJhdGluZy5cIlxuZXgxNW5lcSA8LSBcIkNoZWNrIHRoYXQgeWxhYiBpcyBGcmVxdWVuY3kuXCJcbmV4MTZuZXEgPC0gXCJDaGVjayB0aGF0IGJyZWFrcyBpcyAxMS5cIlxuZXgxN25lcSA8LSBcIkNoZWNrIHRoYXQgY29sIGlzIGNtLmNvbG9ycyg4KS5cIlxuXG5leCgpICU+JVxuICBjaGVja19mdW5jdGlvbihcImhpc3RcIikgJT4lXG4gIGNoZWNrX3Jlc3VsdCgpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgxMW5lcSlcblxuc3RhdGUxIDwtIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiaGlzdFwiKVxuc3RhdGUxICU+JVxuICBjaGVja19hcmcoXCJ4XCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgxMm5lcSlcbnN0YXRlMSAlPiVcbiAgY2hlY2tfYXJnKFwibWFpblwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MTNuZXEpXG5zdGF0ZTEgJT4lXG4gIGNoZWNrX2FyZyhcInhsYWJcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDE0bmVxKVxuc3RhdGUxICU+JVxuICBjaGVja19hcmcoXCJ5bGFiXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgxNW5lcSlcbnN0YXRlMSAlPiVcbiAgY2hlY2tfYXJnKFwiYnJlYWtzXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgxNm5lcSlcbnN0YXRlMSAlPiVcbiAgY2hlY2tfYXJnKFwiY29sXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgxN25lcSlcblxuXG5zdWNjZXNzX21zZyhcIkdvb2QgSm9iISBIaXN0b2dyYW1zIGFyZSB2ZXJ5IGltcG9ydGFudCB0b29scyBmb3IgZGF0YSBhbmFseXNpcy5cIilcbiJ9

8.2.3 Activity: Histogram

Using the cafe data frame, create a histogram of age. The histogram should meet these specifications:

  • Title: Histogram
  • X-axis label: Age
  • Y-axis label: Frequency
  • Breaks: 10
  • Colors: 12 colors from the topo.colors palette

Copy/paste the histogram in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YSBmcmFtZSwgY3JlYXRlIGEgaGlzdG9ncmFtIG9mIGFnZSB0aGF0IG1hdGNoZXMgdGhlIGxpc3RlZCBzcGVjaWZpY2F0aW9ucy5cblxuXG4ifQ==

8.3 Density Plot

A density plot provides the same information as a histogram but it is smoothed out so it is easier to read. Here is the same NY City temperature data from an earlier histogram but drawn as a density plot.

Given the previous plot it is natural to wonder, “What is density?” This is a calculated value such that the total area under the curve is assumed to be one and then each point along the x-axis is calculated to contribute the correct proportional amount to that total density. For many purposes it is adequate to just consider a density plot to be a smoothed histogram. As just one other example, following is a density plot of the bi-modal histogram presented earlier. The density plot makes the bi-modal nature of the data very evident.

8.3.1 Demonstration: Density Plot

The following script creates a density plot for the Rape vector in the USArrests data frame.

  • Line 2: Create a density plot for USArrests$Rape.
  • Line 3: The main title for the graph.
  • Line 4: The title for the x-axis.
  • Line 5: The title for the y-axis.
  • Line 6: The type of line to draw. “Lwd 2” is a relitively thick line that is easy to see on a graph.
  • Line 7: The color for this graph is “magenta4”
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlbnNpdHkgcGxvdCBvZiBVUyBBcnJlc3RzIGZvciByYXBlLlxucGxvdChkZW5zaXR5KFVTQXJyZXN0cyRSYXBlKSxcbiAgbWFpbiA9IFwiVVMgUmFwZSBBcnJlc3RzXCIsXG4gIHhsYWIgPSBcIkFycmVzdHNcIixcbiAgeWxhYiA9IFwiRGVuc2l0eVwiLFxuICBsd2QgPSAyLFxuICBjb2wgPSBcIm1hZ2VudGE0XCJcbilcbiJ9

8.3.2 Guided Practice: Density Plot

Using the USJudgeRatings data frame, create a density plot of the INTG vector. The plot should meet these specifications:

  • Title: US Judge Ratings
  • X-axis label: Integrity Rating
  • Y-axis label: Density
  • Lwd: 2
  • Color: blue4
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBDcmVhdGUgdGhlIGRlbnNpdHkgcGxvdCBhcyBzcGVjaWZpZWQgYWJvdmUuXG5cbiIsInNvbHV0aW9uIjoiXG5wbG90KGRlbnNpdHkoVVNKdWRnZVJhdGluZ3MkSU5URyksXG4gIG1haW4gPSBcIlVTIEp1ZGdlIFJhdGluZ3NcIixcbiAgeGxhYiA9IFwiSW50ZWdyaXR5IFJhdGluZ1wiLFxuICB5bGFiID0gXCJEZW5zaXR5XCIsXG4gIGx3ZCA9IDIsXG4gIGNvbCA9IFwiYmx1ZTRcIlxuKVxuIiwic2N0IjoiXG4jIFRoZXNlIGFyZSBudW1iZXJlZCBieSB0aGUgZXhlcmNpc2UgbnVtYmVyIGFuZCBhIHNlcXVlbmNlIHdpdGhpbiB0aGUgZXhlcmNpc2UuXG4jIFRoZSBzZXF1ZW5jZSBudW1iZXIgaXMgbm90IG5lY2Vzc2FyeSBpZiB0aGVyZSBpcyBvbmx5IG9uZSBjaGVjay5cbmV4MjBuZXEgPC0gXCJDaGVjayB0aGUgZGVuc2l0eSBwbG90IHNldHRpbmdzIHRvIGJlIHN1cmUgdGhleSBtYXRjaCB0aGUgcmVxdWlyZW1lbnQuXCJcbmV4MjFuZXEgPC0gXCJDaGVjayB0aGF0IHRoZSBkZW5zaXR5IHBsb3QgaXMgdXNpbmcgdGhlIFVTSnVkZ2VSYXRpbmdzJElOVEcgdmVjdG9yLlwiXG5leDIybmVxIDwtIFwiQ2hlY2sgdGhhdCBtYWluIGlzIFVTIEp1ZGdlIFJhdGluZ3MuXCJcbmV4MjNuZXEgPC0gXCJDaGVjayB0aGF0IHhsYWIgaXMgSW50ZWdyaXR5IFJhdGluZy5cIlxuZXgyNG5lcSA8LSBcIkNoZWNrIHRoYXQgeWxhYiBpcyBEZW5zaXR5XCJcbmV4MjVuZXEgPC0gXCJDaGVjayB0aGF0IGx3ZCBpcyAyLlwiXG5leDI2bmVxIDwtIFwiQ2hlY2sgdGhhdCBjb2wgaXMgYmx1ZTQuXCJcblxuZXgoKSAlPiVcbiAgY2hlY2tfZnVuY3Rpb24oXCJwbG90XCIpICU+JVxuICBjaGVja19yZXN1bHQoKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MjBuZXEpXG5cbnN0YXRlMjEgPC0gZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJkZW5zaXR5XCIpXG5zdGF0ZTIxICU+JVxuICBjaGVja19hcmcoXCJ4XCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgyMW5lcSlcblxuc3RhdGUyMiA8LSBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInBsb3RcIilcbnN0YXRlMjIgJT4lXG4gIGNoZWNrX2FyZyhcIm1haW5cIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDIybmVxKVxuc3RhdGUyMiAlPiVcbiAgY2hlY2tfYXJnKFwieGxhYlwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MjNuZXEpXG5zdGF0ZTIyICU+JVxuICBjaGVja19hcmcoXCJ5bGFiXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgyNG5lcSlcbnN0YXRlMjIgJT4lXG4gIGNoZWNrX2FyZyhcImx3ZFwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MjVuZXEpXG5zdGF0ZTIyICU+JVxuICBjaGVja19hcmcoXCJjb2xcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDI2bmVxKVxuXG5cbnN1Y2Nlc3NfbXNnKFwiR29vZCBKb2IhIERlbnNpdHkgcGxvdHMgYXJlIHZlcnkgaW1wb3J0YW50IHRvb2xzIGZvciBkYXRhIGFuYWx5c2lzLlwiKVxuIn0=

8.3.3 Activity 8.2: Density Plot

Using the cafe data frame, create a density plot of age. The plot should meet these specifications:

  • Title: Density Plot
  • X-axis label: Age
  • Y-axis label: Density
  • Lwd: 2
  • Color: cadetblue4

Copy/paste the density plot in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YSBmcmFtZSwgY3JlYXRlIGEgZGVuc2l0eSBwbG90IG9mIGFnZSB0aGF0IG1hdGNoZXMgdGhlIGxpc3RlZCBzcGVjaWZpY2F0aW9ucy5cblxuXG4ifQ==

8.4 Line Graph

Line graphs display the frequency of some value in a linear form that makes trend detection easier. These types of graphs are especially useful with what is called “time series” data; that is, data that are gathered over a long period of time. As an example, consider the following from the airmiles dataset which charts the number of passenger miles on US commercial airlines from 1937 to 1960.

The following figure shows the number of accidental deaths in the United States from 1973 until 1979 by month taken from the USAccDeaths data frame. This line graph clearly shows a seasonal difference where there are more accidental deaths in the summer months than winter and detecting this type of seasonal variation is one of the strengths of a line graph.

As one final example, the following figure shows the approval rating for US Presidents from 1945 until 1975 taken from the presidents data frame. This line graph shows a very high approval rating in 1945 (Roosevelt at the end of WWII) with dips in the early 1950’s (Truman and the Korean Conflict) and about 1974 (Watergate and Nixon’s resignation). Notice that there are two gaps in the line (late 1940’s and 1973) caused by missing data in the data frame.

8.4.1 Demonstration: Line Graph

Line graphs like those shown above are very useful for detecting a change in some variable, especially over time. The following script creates a line graph of the monthly number of lung disease deaths in the United Kingdom between 1974 and 1979 as found in the ldeaths data frame. It shows a very interesting cycle where there are consistently more deaths in the winter months than the summer months.

  • Line 2: The plot function is called for the ldeaths data frame. By itself, plot will create a scatter plot but Line 6 in this script specifies a line graph.
  • Lines 3-5: This creates the main title of the line graph along with the labels for the x-axis and y-axis.
  • Line 6: This specifies the type of line graph wanted. The possible values are:
    • p for points
    • l for lines (this is a lower-case “L”)
    • b for both
    • c for the lines part alone of “b”
    • o for both ’overplotted’ (that is a lower-case “O”, not zero)
    • h for ‘histogram-like’ (or ‘high-density’) vertical lines
    • s for stair steps (that is a lower-case “S”)
    • S for other steps (that is a capital “S”)
    • n for no plotting
      Researchers could quickly try several different settings to determine the best way to present the data and students working through this tutorial can change the value on Line 6 to see the different types of plots.
  • Line 7: The lwd parameter sets the width of the line. The default value is 1 so this line specifies a double-width line in order to make it easier to see.
  • Line 8: This sets the color of the line to red in order to set it off from the axis and text.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIExpbmUgR3JhcGhcbnBsb3QobGRlYXRocyxcbiAgbWFpbiA9IFwiTHVuZyBEaXNlYXNlIERlYXRoc1wiLFxuICB4bGFiID0gXCJNb250aFwiLFxuICB5bGFiID0gXCJOdW1iZXJcIixcbiAgdHlwZT1cImxcIixcbiAgbHdkID0gMixcbiAgY29sID0gXCJyZWRcIlxuKVxuIn0=

The above graph shows an interesting cyclic variation in the number of deaths over the course of a year. The years listed on the x-axis indicate January of that year. The number of deaths seems to spike high over the winther months and hit a low in about August or September. This type of cycle is very easy to see on this plot but would be difficult to detect in a table full of numbers.

8.4.2 Guided Practice: Line Graph

Using the UKgas data frame, create a line graph should meet these specifications:

  • Plot: UKgas
  • Title: UK Quarterly Gas Consumption
  • X-axis label: Year
  • Y-axis label: Millions of Therms
  • Type: l (this is a lower-case “L”)
  • Lwd: 2
  • Color: lightskyblue
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBDcmVhdGUgYSBsaW5lIGdyYXBoIHVzaW5nIHRoZSBzcGVjaWZpY2F0aW9ucyBsaXN0ZWQgYWJvdmUuXG5cbiIsInNvbHV0aW9uIjoiXG4jIExpbmUgR3JhcGhcbnBsb3QoVUtnYXMsXG4gIG1haW4gPSBcIlVLIFF1YXJ0ZXJseSBHYXMgQ29uc3VtcHRpb25cIixcbiAgeGxhYiA9IFwiWWVhclwiLFxuICB5bGFiID0gXCJNaWxsaW9ucyBvZiBUaGVybXNcIixcbiAgdHlwZT1cImxcIixcbiAgbHdkID0gMixcbiAgY29sID0gXCJsaWdodHNreWJsdWVcIlxuKVxuIiwic2N0IjoiXG4jIFRoZXNlIGFyZSBudW1iZXJlZCBieSB0aGUgZXhlcmNpc2UgbnVtYmVyIGFuZCBhIHNlcXVlbmNlIHdpdGhpbiB0aGUgZXhlcmNpc2UuXG4jIFRoZSBzZXF1ZW5jZSBudW1iZXIgaXMgbm90IG5lY2Vzc2FyeSBpZiB0aGVyZSBpcyBvbmx5IG9uZSBjaGVjay5cbmV4MzFuZXEgPC0gXCJDaGVjayB0aGUgbGluZSBncmFwaCBzZXR0aW5ncyB0byBiZSBzdXJlIHRoZXkgbWF0Y2ggdGhlIHJlcXVpcmVtZW50LlwiXG5leDMybmVxIDwtIFwiQ2hlY2sgdGhhdCB0aGUgbGluZSBncmFwaCBpcyB1c2luZyB0aGUgVUtnYXMgdmVjdG9yLlwiXG5leDMzbmVxIDwtIFwiQ2hlY2sgdGhhdCBtYWluIGlzIFVLIFF1YXJ0ZXJseSBHYXMgQ29uc3VtcHRpb24uXCJcbmV4MzRuZXEgPC0gXCJDaGVjayB0aGF0IHhsYWIgaXMgWWVhci5cIlxuZXgzNW5lcSA8LSBcIkNoZWNrIHRoYXQgeWxhYiBpcyBNaWxsaW9ucyBvZiBUaGVybXMuXCJcbmV4MzZuZXEgPC0gXCJDaGVjayB0aGF0IHR5cGUgaXMgbCAodGhhdCBpcyBhIGxvd2VyLWNhc2UgbGV0dGVyIEwpLlwiXG5leDM3bmVxIDwtIFwiQ2hlY2sgdGhhdCBsd2QgaXMgMi5cIlxuZXgzOG5lcSA8LSBcIkNoZWNrIHRoYXQgY29sIGlzIGxpZ2h0c2t5Ymx1ZS5cIlxuXG5leCgpICU+JVxuICBjaGVja19mdW5jdGlvbihcInBsb3RcIikgJT4lXG4gIGNoZWNrX3Jlc3VsdCgpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgzMW5lcSlcblxuc3RhdGUzIDwtIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwicGxvdFwiKVxuc3RhdGUzICU+JVxuICBjaGVja19hcmcoXCJ4XCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgzMm5lcSlcbnN0YXRlMyAlPiVcbiAgY2hlY2tfYXJnKFwibWFpblwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MzNuZXEpXG5zdGF0ZTMgJT4lXG4gIGNoZWNrX2FyZyhcInhsYWJcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDM0bmVxKVxuc3RhdGUzICU+JVxuICBjaGVja19hcmcoXCJ5bGFiXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXgzNW5lcSlcbnN0YXRlMyAlPiVcbiAgY2hlY2tfYXJnKFwidHlwZVwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MzZuZXEpXG5zdGF0ZTMgJT4lXG4gIGNoZWNrX2FyZyhcImx3ZFwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MzduZXEpXG5zdGF0ZTMgJT4lXG4gIGNoZWNrX2FyZyhcImNvbFwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4MzhuZXEpXG5cblxuc3VjY2Vzc19tc2coXCJQZXJmZWN0ISBBIGxpbmUgZ3JhcGggY2FuIHJldmVhbCBpbnRlcmVzdGluZyB0cmVuZHMgb3ZlciB0aW1lLlwiKVxuIn0=

8.4.3 Activity: Line Graph

Create a line graph of the lynx data frame. The plot should meet these specifications:

  • Plot: lynx
  • Title: Line Graph
  • X-axis label: Year
  • Y-axis label: Number
  • Type: l (this is a lower-case “L”)
  • Lwd: 2
  • Color: firebrick4

Copy/paste the line graph in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgQ3JlYXRlIGEgbGluZSBncmFwaCBvZiB0aGUgbHlueCBkYXRhIGZyYW1lIHRoYXQgbWF0Y2hlcyB0aGUgbGlzdGVkIHNwZWNpZmljYXRpb25zLlxuXG5cbiJ9

8.5 Plot

Plots (often called “scatter plots”) are used to show how two different variables are related. Scatter plots are often used in connection with correlation where they visually indicate the correlation between two variables. For example, The following figure is the plot of stopping distance vs. speed from the cars data frame.

The previous figure shows that as a car’s speed increases the stopping distance also increases, which is exactly what would be expected. (Note: this data were gathered on cars in the 1920s.) Often, a line of best fit is included with a plot to better visualize the relationship between the two variables, as illustrated in the following figure.

As a second example, The following figure is a plot of the eruption time for Old Faithful geyser, taken from the faithful data frame, as a function of the waiting time between eruptions.

The previous figure shows that as the time between eruptions increases the time that the eruption lasts also increases. Notice that this scatter plot also suggests that the data are bi-modal since there are two clusters of points and a researcher would want to explore that matter before doing much else with the data.

8.5.1 Plot Point Types

R has 25 different types of symbols that could be used for the points on a plot as shown on the following chart.

Symbols numbered 1-20 are a single color but symbols 21-25 have two colors and both the background and line color can be specified. Using black for the line color (the default) and a background color that has high contrast with the plot area makes the points easier to see on a graph. The point symbol used is specified in the pch= attribute as demonstrated in the following plots.

8.5.2 Demonstration: Plot

The following script creates a simple plot using the swiss data frame.

  • Line 2: This starts the plot function. Like many R functions, plot requires a formula input in the form of y ~ x where the dependent variable (the y-axis) is first in the formula and the independent variable (the x-axis) is second. For this plot, Education is the independent variable and will be on the x-axis while Fertility is the dependent variable on the y-axis. The researcher was answering the question “Does education level affect the number of children people have?”
  • Lines 3-5: This creates the main title and the labels for the x-axis and y-axis.
  • Line 6: The pch attribute is the type of point used on the graph. R has a number of pre-defined types of points, as illustrated above, and for this graph type number 21 is selected.
  • Line 7: The background color of the points is specified using the bg attribute. This this case it is a shade of chartreuse, a yellow-green hue.
  • Line 8: The line color of the circle is specified using the col attribute.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFNpbXBsZSBQbG90XG5wbG90KHN3aXNzJEZlcnRpbGl0eSB+IHN3aXNzJEVkdWNhdGlvbixcbiAgbWFpbiA9IFwiU3dpc3MgSW5kaWNhdG9yc1wiLFxuICB4bGFiID0gXCJFZHVjYXRpb25cIixcbiAgeWxhYiA9IFwiRmVydGlsaXR5XCIsXG4gIHBjaCA9IDIxLFxuICBiZyA9IFwiY2hhcnRyZXVzZTNcIixcbiAgY29sID0gXCJibGFja1wiXG4pXG4ifQ==

The following script creates a plot with a line of best fit.

  • Lines 2-9: This demonstration plots two vectors from the attitude data frame, rating is the independent variable on the x-axis while complaints is the dependent variable on the y-axis. The researcher was answering the question “Do employees with higher overall ratings handle complaints better?” The other parameters of this plot function are defined for the previous plot and are not further discussed here.
  • Line 10: This starts the abline function that ends on line 13. Abline draws a line “from point A to point B” (which is why the function is named “abline”) on an existing plot. In this case, the line to be drawn is calculated with the lm function, which calculates the slope and y-intercept of the line of best fit for the two specified vectors. The lm function was introduced in the tutorial about Correlation and Regression.
  • Lines 11-12: These are the parameters for the line of best fit as drawn on the plot.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJcbiMgTGluZSBvZiBCZXN0IEZpdFxucGxvdChhdHRpdHVkZSRjb21wbGFpbnRzIH4gYXR0aXR1ZGUkcmF0aW5nLFxuICBtYWluID0gXCJBdHRpdHVkZSBEYXRhXCIsXG4gIHhsYWIgPSBcIk92ZXJhbGwgUmF0aW5nXCIsXG4gIHlsYWIgPSBcIkNvbXBsYWludHNcIixcbiAgcGNoID0gMjEsXG4gIGJnID0gXCJraGFraTJcIixcbiAgY29sID0gXCJibGFja1wiXG4pXG5hYmxpbmUobG0oYXR0aXR1ZGUkY29tcGxhaW50cyB+IGF0dGl0dWRlJHJhdGluZyksXG4gIGx3ZCA9IDIsXG4gIGNvbCA9IFwiZGFya3NlYWdyZWVuNFwiXG4pXG4ifQ==

8.5.3 Guided Practice: Plot

Using the rock data frame, create a plot of area (y-axis) as a function of peri (x-axis). The graph should meet these specifications:

Plot:

  • Title: Rock Area by Perimeter
  • X-axis label: Perimeter
  • Y-axis label: Area
  • Pch: 21
  • Bg: goldenrod3
  • Color: black

Abline:

  • Lwd: 2
  • Color: violetred3
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBDcmVhdGUgYSBzY2F0dGVyIHBsb3Qgd2l0aCBhIGxpbmUgb2YgYmVzdCBmaXQgYXMgc3BlY2lmaWVkIGFib3ZlLlxuXG4iLCJzb2x1dGlvbiI6IlxuIyBMaW5lIG9mIEJlc3QgRml0XG5wbG90KHJvY2skYXJlYSB+IHJvY2skcGVyaSxcbiAgbWFpbiA9IFwiUm9jayBBcmVhIGJ5IFBlcmltZXRlclwiLFxuICB4bGFiID0gXCJQZXJpbWV0ZXJcIixcbiAgeWxhYiA9IFwiQXJlYVwiLFxuICBwY2ggPSAyMSxcbiAgYmcgPSBcImdvbGRlbnJvZDNcIixcbiAgY29sID0gXCJibGFja1wiXG4pXG5hYmxpbmUobG0ocm9jayRhcmVhIH4gcm9jayRwZXJpKSxcbiAgY29sID0gXCJ2aW9sZXRyZWQzXCIsXG4gIGx3ZCA9IDJcbilcbiIsInNjdCI6IlxuZXg0MDFuZXEgPC0gXCJDaGVjayB0aGF0IHRoZSBwbG90IGZvcm11bGEgaXMgcm9jayRhcmVhIH4gcm9jayRwZXJpLlwiXG5leDQwMm5lcSA8LSBcIkNoZWNrIHRoYXQgbWFpbiBpcyBSb2NrIEFyZWEgYnkgUGVyaW1ldGVyLlwiXG5leDQwM25lcSA8LSBcIkNoZWNrIHRoYXQgeGxhYiBpcyBQZXJpbWV0ZXIuXCJcbmV4NDA0bmVxIDwtIFwiQ2hlY2sgdGhhdCB5bGFiIGlzIEFyZWEuXCJcbmV4NDA1bmVxIDwtIFwiQ2hlY2sgdGhhdCBwY2ggaXMgMjEuXCJcbmV4NDA2bmVxIDwtIFwiQ2hlY2sgdGhhdCBiZyBpcyBnb2xkZW5yb2QzLlwiXG5leDQwN25lcSA8LSBcIkNoZWNrIHRoYXQgY29sIGlzIGJsYWNrLlwiXG5cbmV4NDA4bmVxIDwtIFwiQ2hlY2sgdGhhdCB0aGUgbG0gZm9ybXVsYSBpcyByb2NrJGFyZWEgfiByb2NrJHBlcmkuXCJcblxuZXg0MDluZXEgPC0gXCJDaGVjayB0aGF0IGNvbCBpcyB2aW9sZXRyZWQzLlwiXG5leDQxMG5lcSA8LSBcIkNoZWNrIHRoYXQgbHdkIGlzIDIuXCJcblxuc3RhdGU0MSA8LSBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInBsb3RcIilcbiMgc3RhdGU0MSAlPiVcbiMgICBjaGVja19hcmcoXCJ4XCIpICU+JVxuIyAgIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwMW5lcSlcbnN0YXRlNDEgJT4lXG4gIGNoZWNrX2FyZyhcIm1haW5cIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwMm5lcSlcbnN0YXRlNDEgJT4lXG4gIGNoZWNrX2FyZyhcInhsYWJcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwM25lcSlcbnN0YXRlNDEgJT4lXG4gIGNoZWNrX2FyZyhcInlsYWJcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwNG5lcSlcbnN0YXRlNDEgJT4lXG4gIGNoZWNrX2FyZyhcInBjaFwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4NDA1bmVxKVxuc3RhdGU0MSAlPiVcbiAgY2hlY2tfYXJnKFwiYmdcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwNm5lcSlcbnN0YXRlNDEgJT4lXG4gIGNoZWNrX2FyZyhcImNvbFwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4NDA3bmVxKVxuXG5zdGF0ZTQyIDwtIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwibG1cIilcbnN0YXRlNDIgJT4lXG4gIGNoZWNrX2FyZyhcImZvcm11bGFcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwOG5lcSlcblxuXG5zdGF0ZTQzIDwtIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiYWJsaW5lXCIpXG5zdGF0ZTQzICU+JVxuICBjaGVja19hcmcoXCJjb2xcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDQwOW5lcSlcbnN0YXRlNDMgJT4lXG4gIGNoZWNrX2FyZyhcImx3ZFwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4NDEwbmVxKVxuXG5zdWNjZXNzX21zZyhcIlBlcmZlY3QhIEEgc2NhdHRlcnBsb3QgaXMgb25lIG9mIHRoZSBtb3N0IGNvbW1vbiBncmFwaHMgdXNlZCBpbiByZXNlYXJjaC5cIilcbiJ9

8.5.4 Activity: Line of Best Fit

Using the cafe data frame, create a simple plot with bill (y-axis) as a function of length (x-axis), and a line of best fit to see if there is a relationship between the length of the meal and the bill. The plot should meet these specifications:

Plot:

  • Title: Line of Best Fit
  • X-axis label: Length of Meal
  • Y-axis label: Bill
  • Pch: 23
  • Bg: indianred4
  • Color: black

Abline:

  • Lwd: 2
  • Color: mediumblue

Copy/paste the plot in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YXNldCwgY3JlYXRlIGEgc2ltcGxlIHBsb3Qgb2YgYmlsbCAoeSkgYW5kIGxlbmd0aCAoeCkgcGx1cyBhIGxpbmUgb2YgYmVzdCBmaXQgdGhhdCBtYXRjaGVzIHRoZSBsaXN0ZWQgc3BlY2lmaWNhdGlvbnMuXG5cblxuIn0=

8.6 Q-Q Plot

Very often, researchers in the exploratory phase of a project need to know whether a data frame is normally distributed. Creating a histogram or density plot is very helpful in that regard, but a Q-Q (“Quantile-Quantile”) Plot is a typical method used to determine if a data frame is normally distributed. The following figure is a Q-Q plot of the New York City temperatures in the airquality data frame.

An absolutely perfect normal distribution would generate a straight line Q-Q plot and the blue line in the previous figure is ideal. Interpreting a Q-Q plot is more art than science but as long as most values are near the ideal line then the data are considered normally distributed. The following figure is a Q-Q plot for the Old Faithful eruption, as plotted earlier in this tutorial.

The above plot shows a typical bi-modal pattern. On the left side of the plot is a reasonably flat area from -3 to -0.5 on the x-axis and then the plot skips upward and creates a second reasonably flat area between 0.5 and 3. Imagine two parallel lines running through the lower and upper parts of the plot and then notice that the green “ideal” line does not get very close to the slope of either of those two lines. This Q-Q plot shows that the data are not normally distributed.

The following figure shows a curved Q-Q plot that is typical for a data frame that is skewed. In this case, the plotted sunspot data have a heavy positive skew which is indicated by the long “tail” on the left side of the plot. That skew should be verified with a histogram or density plot.

8.6.1 Demonstration: Q-Q Plot

The following script creates a Q-Q plot for a normally-distributed data frame.

  • Lines 2-4: This executes the qqnorm function and passes that function the weight variable from the chickwts data frame. This function draws the Q-Q Plot. The only parameter needed is main, which adds the title to the plot.
  • Lines 6-9: This executes the qqline function and passes that function the weight variable from the chickwts data frame. This function draws the straight line that indicates a perfect Q-Q plot. The only parameters passed to the function in this script is to set the size to 2 and the color to blue, in the same way for the scatterplots above.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFEtUSBQbG90XG5xcW5vcm0oY2hpY2t3dHMkd2VpZ2h0LFxuICBtYWluID0gXCJRUSBQbG90OiBDaGljayBXZWlnaHRzXCJcbilcbiMgUT1RIExpbmVcbnFxbGluZShjaGlja3d0cyR3ZWlnaHQsXG4gIGx3ZCA9IDIsXG4gIGNvbCA9IFwiYmx1ZVwiXG4pXG4ifQ==

8.6.2 Guided Practice: Q-Q Plot

Using the swiss data frame, create a Q-Q plot of the Fertility vector. The plot should have a main title of QQ Plot: Swiss Fertility and a red Q-Q line with a width of 2.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuIyBObyBwcmUtZXhlcmNpc2UgY29kZSBmb3IgdGhpcyBleGVyY2lzZVxuIiwic2FtcGxlIjoiIyBDcmVhdGUgYSBRLVEgcGxvdCBvZiB0aGUgRmVydGlsaXR5IHZlY3RvciBpbiB0aGUgc3dpc3MgZGF0YSBmcmFtZSB1c2luZyB0aGUgcmVxdWlyZW1lbnRzIGxpc3RlZCBhYm92ZS5cblxuIiwic29sdXRpb24iOiJcbnFxbm9ybShzd2lzcyRGZXJ0aWxpdHksXG4gIG1haW4gPSBcIlFRIFBsb3Q6IFN3aXNzIEZlcnRpbGl0eVwiXG4pXG5xcWxpbmUoc3dpc3MkRmVydGlsaXR5LFxuICBsd2QgPSAyLFxuICBjb2wgPSBcInJlZFwiXG4pXG4iLCJzY3QiOiJcbmV4NTFuZXEgPC0gXCJDaGVjayB0aGF0IHRoZSBxcW5vcm0gZnVuY3Rpb24gaXMgZm9yIHN3aXNzJEZlcnRpbGl0eSAodGhlIGNhcGl0YWwgRiBvbiBGZXJ0aWxpdHkgaXMgaW1wb3J0YW50KS5cIlxuZXg1Mm5lcSA8LSBcIkNoZWNrIHRoYXQgbWFpbiBpcyBRUSBQbG90OiBTd2lzcyBGZXJ0aWxpdHkuXCJcbmV4NTNuZXEgPC0gXCJDaGVjayB0aGF0IHRoZSBxcWxpbmUgZnVuY3Rpb24gaXMgZm9yIHN3aXNzJEZlcnRpbGl0eSAodGhlIGNhcGl0YWwgRiBvbiBGZXJ0aWxpdHkgaXMgaW1wb3J0YW50KS5cIlxuZXg1NG5lcSA8LSBcIkNoZWNrIHRoYXQgY29sIGlzIHJlZC5cIlxuZXg1NW5lcSA8LSBcIkNoZWNrIHRoYXQgbHdkIGlzIDIuXCJcblxuc3RhdGU1MSA8LSBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInFxbm9ybVwiKVxuc3RhdGU1MSAlPiVcbiAgY2hlY2tfYXJnKFwieVwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4NTFuZXEpXG5zdGF0ZTUxICU+JVxuICBjaGVja19hcmcoXCJtYWluXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXg1Mm5lcSlcblxuc3RhdGU1MiA8LSBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInFxbGluZVwiKVxuc3RhdGU1MiAlPiVcbiAgY2hlY2tfYXJnKFwieVwiKSAlPiVcbiAgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZyA9IGV4NTNuZXEpXG5zdGF0ZTUyICU+JVxuICBjaGVja19hcmcoXCJjb2xcIikgJT4lXG4gIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2cgPSBleDU0bmVxKVxuc3RhdGU1MiAlPiVcbiAgY2hlY2tfYXJnKFwibHdkXCIpICU+JVxuICBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnID0gZXg1NW5lcSlcblxuc3VjY2Vzc19tc2coXCJHcmVhdCBKb2IhIFVzaW5nIFEtUSBQbG90IG1ha2VzIGl0IGVhc3kgdG8gZ3JhcGhpY2FsbHkgZGV0ZXJtaW5lIGlmIGRhdGEgYXJlIG5vcm1hbGx5IGRpc3RyaWJ1dGVkLlwiKVxuIn0=

8.6.3 Activity: Q-Q Plot

Using the cafe data frame, create a Q-Q plot of age. The plot should h ave a title of Q-Q Plot and a sienna4 Q-Q line with a width of 2.

Copy/paste the plot in the deliverable document for this lab.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlxuY2FmZSA8LSByZWFkLmNzdignaHR0cHM6Ly9sYWJzLmJhc3YzMTYuY29tL2NhZmUuY3N2JylcbiIsInNhbXBsZSI6IiMgVXNpbmcgdGhlIGNhZmUgZGF0YXNldCwgY3JlYXRlIGEgUS1RIHBsb3Qgb2YgYWdlIHRoYXQgbWF0Y2hlcyB0aGUgbGlzdGVkIHNwZWNpZmljYXRpb24uXG5cblxuIn0=

8.7 Deliverable

Complete the activities in this lab and consolidate the responses into a single document. Name the document with your name and “Lab 8,” like “George Self Lab 8” and submit that document for grade.