B About The Data Frames
There are a number of data frames used in the lab exercises in this book and this appendix lists basic information about those data frames.
B.1 airmiles
Passenger Miles on Commercial US Airlines, 1937-1960. The revenue passenger miles flown by commercial airlines in the United States for each year from 1937 to 1960. This is a time series of length 24; annually, 1937-1960 (from the FAA Statistical Handbook of Aviation).
B.2 airquality
New York Air Quality Measurements. Daily air quality measurements in New York, May to September 1973. This is a data frame with 154 observations on 6 variables:
Name | Type | Description |
---|---|---|
Ozone | int | Ozone (ppb) |
Solar.R | int | Solar R (lang) |
Wind | numeric | Wind (mpg) |
Temp | numeric | Temperature (degrees F) |
Month | numeric | Month (1-12) |
Day | numeric | Day of month (1-31) |
B.3 attenu
The Joyner-Boore Attenuation Data. This data gives peak accelerations measured at various observation stations for 23 earthquakes in California. The data have been used by various workers to estimate the attenuating affect of distance on ground acceleration. This is a data frame with 182 observations on 5 variables:
Name | Type | Description |
---|---|---|
event | numeric | Event number |
mag | numeric | Moment magnitude |
station | fac | Station number (117 levels) |
dist | numeric | Station-hypocenter distance (km) |
accel | numeric | Peak acceleration |
B.4 attitude
The Chatterjee-Price Attitude Data. From a survey of the clerical employees of a large financial organization, the data are aggregated from the questionnaires of the approximately 35 employees for each of 30 (randomly selected) departments. The numbers give the percent proportion of favourable responses to seven questions in each department. This is a data frame with 30 observations on 7 variables:
Name | Type | Description |
---|---|---|
ratings | numeric | Overall rating |
complaints | numeric | Handling of employee complaints |
privileges | numeric | Does not allow special privileges |
learning | numeric | Opportunity to learn |
raises | numeric | Raises based on performance |
critical | numeric | Too critical |
advance | numeric | Advancement |
B.5 beaver
Body Temperature Series of Two Beavers. Reynolds (1994) describes a small part of a study of the long-term temperature dynamics of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females, but data from a one period of less than a day for each of two animals is used there. There are two data frames: beaver1 has 114 rows and 4 columns on body temperature measurements at 10 minute intervals and beaver2 has 100 rows and 4 columns on body temperature measurements at 10 minute intervals.
Name | Type | Description |
---|---|---|
day | numeric | Day of observation (days since 1990) |
time | numeric | Time of observation (0330 for 3:30 am) |
temp | numeric | Measured body temperature (Celsius) |
activ | numeric | Activity outside the retreat (0, 1) |
B.6 cafe
This is simulated data. Customers of the Main Street Cafe completed surveys over a one week period. This is a data frame with observations on 13 variables:
Name | Type | Description |
---|---|---|
sex | fac | Sex (3 levels: male, female, other) |
age | int | Age |
day | fac | Day (7 levels: Monday, Tuesday, etc.) |
meal | fac | Meal (4 levels: breakfast, lunch, dinner, other) |
length | int | Length of meal (minutes) |
miles | int | Miles driven to cafe |
pref | fac | Seating preference (2 levels: booth, table) |
ptysize | int | Number of people in party |
food | int | Rating for food (ord levels: 1-5 ‘stars’) |
svc | int | Rating for service (ord levels: 1-5 ‘stars’) |
recmd | fac | Would recommend to a friend (2 levels: yes, no) |
bill | numeric | Bill (dollars and cents) |
tip | int | Amount of tip (whole dollars) |
B.7 cars
Speed and Stopping Distances of Cars. The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s. This is a data frame with 50 observations on 2 variables:
Name | Type | Description |
---|---|---|
speed | numeric | Speed (mpg) |
dist | numeric | Stopping distance (ft) |
B.8 ChickWeight
Weight Versus Age of Chicks on Different Diets. The ChickWeight data frame was generated from an experiment on the effect of diet on early growth of chicks. This is a data frame with 578 observations on 4 variables:
Name | Type | Description |
---|---|---|
weight | numeric | Weight (grams) |
Time | numeric | Time (days since birth) |
Chick | ord | A uniqe identifier for each chick |
Diet | fac | A factor of 1-4 indicating the diet |
B.9 chickwts
Chicken Weights by Feed Type. An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. This is a data frame with 71 observations on 2 variables:
Name | Type | Description |
---|---|---|
weight | numeric | Weight (unk units) |
feed | fac | Feed type |
B.10 CO2
Carbon Dioxide Uptake in Grass Plants. This data set is from an experiment on the cold tolerance of the grass species Echinochloa crusgalli. This is a data frame with 84 observations on 5 variables:
Name | Type | Description |
---|---|---|
Plant | ordered | Unique identifier of each plant (12 levels) |
Type | fac | Origin of the plant (Quebec, Mississippi) |
Treatment | fac | Type of treatment (nonchilled, chilled) |
conc | numeric | Ambient CO2 concentratino (mL/L) |
uptake | numeric | CO2 uptake rate (umol/sq meter/sec) |
B.11 esoph
Smoking, Alcohol and (O)esophageal Cancer . Data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. This is a data frame with 88 observations on 5 variables:
Name | Type | Description |
---|---|---|
agegp | ordered | Age group (6 levels) |
alcgp | ordered | Alcohol consumption (4 levels) |
tobgp | ordered | Tobacco consumption (4 levels) |
ncases | numeric | Number of cases |
ncontrols | numeric | Number of controls |
B.12 faithful
Old Faithful Geyser Data. Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. This is a data frame with 272 observations on 2 variables:
Name | Type | Description |
---|---|---|
eruptions | numeric | Eruption time (minutes) |
waiting | numeric | Waiting time to next eruption (minutes) |
B.13 infert
Infertility Study. This is a matched case-control study. This is a data frame with 248 observations on 8 variables:
Name | Type | Description |
---|---|---|
education | factor | Education (0=0-5yrs, 1=6-11yrs, 2=12+yrs) |
age | numeric | Age in years |
parity | numeric | Parity |
induced | numeric | number of prior induced abortions (0=0, 1=1, 2= 2+) |
case | numeric | Status (0=control, 1=case) |
spontaneous | numeric | number of prior spontaneous abortions (0=0, 1=1, 2= 2+) |
stratum | integer | Matched Set Number 1-83 |
pooled.stratum | numeric | Stratum Number 1-63 |
B.14 iris
Edgar Anderson’s Iris Data. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris: Iris setosa, versicolor, and virginica. This is a data frame with 150 observations on 5 variables:
Name | Type | Description |
---|---|---|
Sepal.Length | numeric | Length (cm) |
Sepal.Width | numeric | Width (cm) |
Petal.Length | numeric | Length (cm) |
Petal.Width | numeric | Width (cm) |
Species | factor | Species (3 levels) |
B.15 LakeHuron
Level of Lake Huron 1875-1972. Annual measurements of the level, in feet, of Lake Huron 1875-1972. This is a time series of length 98.
B.16 ldeaths
Monthly Deaths from Lung Diseases in the UK. Monthly deaths recorded from bronchitis, emphysema and asthma in the UK. This is a time series of length 72, 1974-1979.
B.17 longley
Longley’s Economic Regression Data. This data set is a macroeconomic data set which provides a well-known example for a highly collinear regression. This is a data frame with 7 variables observed annually from 1947 to 1962 (n = 16):
Name | Type | Description |
---|---|---|
GNP.deflator | numeric | GNP implicit price deflator (1954=100) |
GNP | numeric | Gross National Product |
Unemployed | numeric | Number employed |
Armed.Forces | numeric | Number in armed forces |
Population | numeric | Population > 14 years of age |
Year | numeric | Year |
Employed | numeric | Number employed |
B.18 lynx
Annual numbers of lynx trappings for 1821–1934 in Canada. This is a time series of length 114, 1821-1934.
B.19 morley
Michelson Speed of Light Data. A classical data of Michelson (but not this one with Morley) on measurements done in 1879 on the speed of light. The data consists of five experiments, each consisting of 20 consecutive “runs”. The response is the speed of light measurement, suitably coded (km/sec, with 299,000 subtracted). This is a data frame with 100 observations on 3 variables:
Name | Type | Description |
---|---|---|
Expt | int | The experiment number, from 1 to 5 |
Run | int | The run number within each experiment |
Speed | int | Speed-of-light measurement |
B.20 mtcars
Motor Trend Car Road Tests. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). This is a data frame with 32 observations on 11 variables:
Name | Type | Description |
---|---|---|
mpg | numeric | Miles/(US) gallon |
cyl | numeric | Number of cylinders |
disp | numeric | Displacement (cu. in.) |
hp | numeric | Gross horsepower |
drat | numeric | Rear axle ratio |
wt | numeric | Weight (1000 lbs) |
qsec | numeric | Quarter-mile time |
vs | numeric | V8 (0) or straight (1) engine |
am | numeric | Automatic (0) or manual (1) transmission |
gear | numeric | Number of forward gears |
carb | numeric | Number of carburetors |
B.21 npk
Classical N, P, K Factorial Experiment. A classical N, P, K (nitrogen, phosphate, potassium) factorial experiment on the growth of peas conducted on 6 blocks. Each half of a fractional factorial design confounding the NPK interaction was used on 3 of the plots. This is a data frame with 24 observations on 5 variables:
Name | Type | Description |
---|---|---|
block | factor | Which block (6 levels) |
N | factor | Indicator for nitrogen (No, Yes) |
P | factor | Indicator for phosphate (No, Yes) |
K | factor | Indicator for potassium (No, Yes) |
yield | numeric | Yield of peas (pounds/plot) |
B.22 PlantGrowth
Results from an Experiment on Plant Growth. Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions. This is a data frame with 30 observations on 2 variables:
Name | Type | Description |
---|---|---|
weight | numeric | Weight of yield (in) |
group | factor | Treatment group (3 levels) |
B.23 presidents
Quarterly Approval Ratings of US Presidents. The (approximate) quarterly approval rating for the President of the United States from the first quarter of 1945 to the last quarter of 1974. This is a time series of length 120.
B.24 rivers
Lengths of Major North American Rivers. This data set gives the lengths (in miles) of 141 “major” rivers in North America, as compiled by the US Geological Survey. This is a vector with 141 observations.
B.25 rock
Measurements on Petroleum Rock Samples. This data set contains measurements on 48 rock samples from a petroleum reservoir. This is a data frame with 48 observations on 4 variables:
Name | Type | Description |
---|---|---|
area | int | Area of pores (pixels in 256 X 256) |
peri | numeric | Perimeter in pixels |
shape | numeric | perimeter/square root of area |
perm | numeric | Permeability in mili-Darcies |
B.26 sleep
Student’s Sleep Data. Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. This is a data frame with 20 observations on 3 variables:
Name | Type | Description |
---|---|---|
extra | numeric | Increase in hours of sleep |
group | fac | Drug given (2 levels) |
ID | fac | Patient ID (10 levels) |
B.27 stackloss
Brownlee’s Stack Loss Plant Data. Operational data of a plant for the oxidation of ammonia to nitric acid. This is a data frame with 21 observations on 4 variables:
Name | Type | Description |
---|---|---|
Air Flow | numeric | Flow of cooling air |
Water Temp | numeric | Cooling water inlet temperature |
Acid Conc | numeric | Concentration of acid [per 1000, minus 500] |
stack.loss | numeric | Stack loss |
B.28 state.division
US State Facts and Figures. Data sets related to the 50 states of the United States of America. This is a factor giving the state division (New England, Middle Atlantic, South Atlantic, East North Central, East South Central, West North Central, West South Central, Mountain, Pacific).
B.29 sunspots
Monthly Sunspot Numbers, 1749-1983. Monthly mean sunspot numbers from 1749 to 1983. Collected at Swiss Federal Observatory, Zurich until 1960, then Tokyo Astronomical Observatory. This is a time series of length 2820.
B.30 swiss
Swiss Fertility and Socioeconomic Indicators (1888) Data. Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. This is a data frame with 47 observations on 6 variables, each of which is in percent (0-100):
Name | Type | Description |
---|---|---|
Fertility | numeric | Common fertility measure |
Agriculture | numeric | % males involved in agriculture |
Examination | int | % receiving high mark on army exam |
Education | int | % education beyond primary school |
Catholic | numeric | % catholic (as opposed to protestant) |
Infant.Mortality | numeric | Live births who live less than 1 year |
B.31 Titanic
Survival of passengers on the Titanic. This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner “Titanic,” summarized according to economic status (class), sex, age and survival. This is a 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:
Num | Name | Levels |
---|---|---|
1 | Class | 1st, 2nd, 3rd, Crew |
2 | Sex | Male, Female |
3 | Age | Child, Adult |
4 | Survived | No, Yes |
B.32 trees
Girth, Height and Volume for Black Cherry Trees. This data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. This is a data frame with 31 observations on 3 variables:
Name | Type | Description |
---|---|---|
Girth | numeric | Tree diameter (inches) |
Height | numeric | Tree height (feet) |
Volume | numeric | Volume of timber (cubic ft) |
B.33 UCB Admissions
Student Admissions at UC Berkeley. Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. This is a 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables. The variables and their levels are as follows:
Num | Name | Levels |
---|---|---|
1 | Admit | Admitted, Rejected |
2 | Gender | Male, Female |
3 | Dept | A, B, C, D, E, F |
B.34 UKgas
UK Quarterly Gas Consumption. Quarterly UK gas consumption from 1960Q1 to 1986Q4, in millions of therms. This is a time series of length 108.
B.35 USAccDeaths
Accidental Deaths in the US 1973-1978. The monthly totals of accidental deaths in the USA. This is a time series of length 72.
B.36 USArrests
Violent Crime Rates by US State. This data set contains statistics about the arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. This is a data frame with 50 observations on 4 variables:
Name | Type | Definition |
---|---|---|
Murder | numeric | Murder arrests per 100,000 |
Assault | integer | Assault arrests per 100,000 |
UrbanPop | integer | Percent urban population |
Rape | numeric | Rape arrests per 100,000 |
B.37 USJudgeRatings
Lawyers’ Ratings of State Judges in the US Superior Court. This contains lawyers’ ratings of state judges in the US Superior Court. This is a data frame with 43 observations on 12 variables:
Name | Type | Definition |
---|---|---|
CONT | numeric | Number of contacts of lawyer with judge |
INTG | numeric | Judicial integrity |
DMNR | numeric | Demeanor |
DILG | numeric | Diligence |
CFMG | numeric | Case flow managing |
DECI | numeric | Prompt decisions |
PREP | numeric | Preparation for trial |
FAMI | numeric | Familiarity with law |
ORAL | numeric | Sound oral rulings |
WRIT | numeric | Sound written rulings |
PHYS | numeric | Physical ability |
RTEN | numeric | Worthy of retention |
B.38 warpbreaks
The Number of Breaks in Yarn During Weaving. This data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. This is a data frame with 54 observations on 3 variables:
Name | Type | Description |
---|---|---|
breaks | numeric | Number of breaks |
wool | fac | Type of wool (levels A, B) |
tension | fac | Tension on wool (levels L, M, H) |