These data sets may be used for any non-commercial purpose. Here are
some definitions.
- mcars4.data: A constructed data set based
roughly on the variables in an SPSS data set.
- mcars4b.data.txt: mcars4.data without the
header that R uses.
- mcars4.xlsx: mcars4.data in an Excel spreadsheet.
- mathtest.txt: Before the beginning of the
Fall term, students in a first-year Calculus class took a diagnostic test
with two parts: Pre-calculus and Calculus. Their High School Calculus
marks and their marks in University Calculus were also available. In
order, the variables in the data file are: Identification code, Mark in
High School Calculus, Score on the Pre-calculus portion of the diagnostic
test, Score on the Calculus portion of the diagnostic test, and mark in
University Calculus. Thanks to Dr. Cleo Boyd for permission to use these
original data.
- math.data.txt: This is a fuller version of the data given in
mathtest.txt above. Thanks again to Cleo Boyd. The variables are
- Identification code
- Course: 1=Catch-up 2=Mainstream 3=Elite 4=NoResponse
- Score on pre-calculus part of diagnostic test
- Score on calculus part of diagnostic test
- High School GPA
- High School Calculus mark
- High School English mark
- University Calculus mark
- First language
- Sex
- National background according to rater one
- National background according to rater two
- Chinese
- Japanese
- Korean
- Vietnamese
- Other Asian
- Eastern European
- Hispanic
- English-speaking
- French
- Italian
- Greek
- Germanic
- Other European
- Middle-Eastern
- Pakistani
- East Indian
- Sub-Saharan
- OTHER or DK
- Sample: 1=Exploratory, 2=Replication (Students were randomly divided into Exploratory and Replication samples.)
Here are the Exploratory and Replication data in two separate files. The exploratory data set has a collapsed coding of nationality.
- math1.data.txt: A cut down, R friendly version of exploremath.data.txt. The last 3 columns of exploremath.data.txt have been omitted. These are real data, with 776 missing values coded as NA. These data have rough edges that have deliberately not been fixed.
- mathcat.data.txt: An even more cut
down and sanitized version of exploremath.data.txt, with no missing values
and catgorical outcomes. The variables are
- hsgpa
- hsengl
- hscalc
- course (Catch-up, Elite, Mainstrm)
- passed (No, Yes
- outcome (Passed, Failed, Disappeared)
- mathcat-replic.data.txt: Like
mathcat, but for the replication sample.
- marks.data.txt: Quiz average, Average
on computer assignments, Midterm and Final from a class years ago. These
are original data.
- openSAT.data.txt: SAT Verbal, SAT
Math and first year GPA. This is a reconstructed data set based on a
Minitab data set.
- Babydouble.data.txt: Simulated
W1, W2, Y for an easy double measurement regression
example.
- bmihealth.data.txt: A
constructed data set illiustrating the advantages of the double
measurement design. Health-related information was obtained from
participants independently, in two sets of measurements. Need to give
more information!
- openpigs.data.txt: Reconstructed data based on a data set in Fuller's Measurement error models.
- Col 1: Farm
- Col 2: Number breeding sows on farm June 1: Questionnaire One
- Col 3: Number sows giving birth June 1 to Aug. 31:
Questionnaire One
- Col 4: Number breeding sows on farm June 1: Questionnaire Two
- Col 5: Number sows giving birth June 1 to Aug. 31:
Questionnaire Two
- Circle.data.txt: Data were simulated
from a model that fails the Parameter Count Rule, but the parameters are
all identifiable where
β1 = β2 = 0. The
variables
Y1 and Y2 were generated with true
β12 +
β22 = 1, while
Y3 and Y4 were generated with true
β1 = β2 = 0.
- timmy1.data.txt: A major Canadian
coffee shop chain is trying to break into the U.S. Market. They assess
the following variables twice on a random sample of coffee-drinking
adults. The two measurements of each variable are conducted at different
times by different interviewers asking somewhat different questions, in
such a way that the errors of measurement may be assumed independent. The
variables are
- Brand Awareness: Familiarity with the coffee shop chain
- Advertising Awareness: Recall for advertising of the coffee shop chain
- Interest in the product category: Mostly this was how much they
say they like doughnuts.
- Purchase Intention: Expressed willingness to go to an outlet of
the coffeeshop chain and make an order.
- Purchase behaviour: Reported dollars spent at the chain during the
2 months following the interview.
All variables were measured on a scale from 0 to 100 except purchase behaviour, which is in dollars.
w1 = Brand Awareness 1
w2 = Brand Awareness 2
w3 = Ad Awareness 1
w4 = Ad Awareness 2
w5 = Interest 1
w6 = Interest 2
v1 = Purchase Intention 1
v2 = Purchase Intention 2
v3 = Purchase Behaviour 1
v4 = Purchase Behaviour 2;
- fish.txt: Original data submitted to the
Journal of Statistical Education by Juha Puranen, Department of
Statistics, University of Helsinki, Finland.
- fish.data.txt: The fish data without the
description.
- LittleStatclassdata.txt: Quiz
average, Computer assignment average, Midterm score and Final Exam score from
a statistics class, long ago.
- LittleStatClassData2.txt:
LittleStatclassdata.txt with two additional categorical variables: Sex
and Race. These are just my guesses, of course.
- LittleStatClassData2b.txt:
LittleStatClassData2.txt without the header that R uses.
- statclass.data.txt: The full
version of the statclass data, with individual quiz scores and computer
assignment scores.
- pigweight.data.txt: Pigs are
routinely given large doses of antibiotics even when they show no signs
of illness, to protect their health under unsanitary conditions. Pigs
were randomly assigned to one of three antibiotics. Dressed weight
(weight of the pig after slaughter and removal of head, intestines and
skin) was the response variable. Mother's and father's live adult weight
were used as covariates. This is a simulated data set.
- ScabDisease.data.txt: This is a
reconstructed data set based on an example in Cochran and Cox's (1958)
classic text Experimental design.. The original data appear on
page 97 of Cochran and Cox's book.
Scab disease is a fungal infection that affects potatoes. The fungus
does not grow well in acidic soil, so investigators designed a study to
see whether adding sulphur to the soil would reduce the scab disease. In
a completely randomized design, plots of land were randomly assigned to
either a control condition or to several levels of sulphur that was
spread on the land in the Spring or Fall. The amounts of sulphur were
either 300 pounds per acre, 600 pounds per acre or 1200 pounds per
acre. The potatoes were harvested at the end of the growing season. One
hundred potatoes were randomly selected from each plot of land. The
potatoes were washed, and then a lab assistant estimated the percent of
each potato's surface that was infected with scab disease. The response
variable is, for each plot of land, the mean percent of the potato's
surface covered with scab disease. The explanatory variable is pounds of
sulphur, in hundreds of pounds; the control is zero.
- ScabDisease.xlsx: The reconstructed scab disease data in a MS Excel spreadsheet.
- ChickWeight.data.txt: This is an R
dataset called chickwts. In the Chick Weights study, newly
hatched chickens were randomly assigned to one of six different feed
supplements, and their weight in grams after 6 weeks was recorded.
- ChickWeight.xlsx: The Chick
Weight data as an Excel spreadsheet.
- ChickenWeight.data.txt: This is getting confusing. This is the R data set ChickWeight, as distinct from chickwts. Borrowing some words from the R documentation,
These data are from an experiment on the effect of diet on early growth of chicks. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups of chicks on different protein diets. The variables are
- weight: Body weight in grams
- Time: Days since birth
- Chick: Identification code
- Diet: 1, 2, 3 or 4
- sales.data.txt: Telephone sales
representatives use computer software to help them locate potential
customers, answer questions, take credit card information and place
orders. Twelve sales representatives were randomly assigned to each of
three new software packages the company was thinking of purchasing. The
data for each sales representative include the software package (1, 2 or
3), sales last quarter with the old software, and sales this quarter with
one of the new software packages. Sales are in number of units sold. This
is a constructed data set that illustrates interactions.
- sales.data.xlsx: The telephone sales
data as an Excel spreadsheet.
- bunnies.data.txt:
An experiment in dentistry was designed to test the effectiveness of a
drug (HEBP) that is supposed to help dental implants become more firmly
attached to the jaw bone. This is an initial test on animals. False teeth
were implanted into the leg bones of rabbits, and the rabbits were randomly
assigned to receive either the drug or a saline solution
(placebo). Technicians administering the drug were blind to experimental
condition.
Rabbits were also randomly assigned to be "sacrificed" after either 3,
6, 9 or 12 days. At that time, the implants were pulled out of the bone by
a machine that measures force in newtons and stiffness in newtons/mm. For
both of these measurements, higher values indicate more healing. A measure
of "pre-load stiffness" in newtons/mm is also available for each
animal. This may be another indicator of how firmly the false tooth was
implanted into the bone, but it might even be a covariate. Nobody can seem
to remember what "preload" means.
The variables are
- Identification code
- Time (3,6,9,12 days of healing)
- Drug (1=HEBP, 0=saline solution)
- Stiffness in newtons/mm
- Force in newtons
- Preload stiffness in newtons/mm
These are original data. Thanks to Dr. E. Fudd of the University of
Transylvania, who has given permission to use these data for non-commercial
purposes.
- mixedup.data.txt: Data from a
balanced 3x4 design -- no content. They are useful for illustrating
nested and mixed models.
- noise.data.txt: Participants
listened to brief political discussions under 5 levels of background
noise. "Discrimination score" is a measure of how well they could tell what
was being said. There are 5 lines of data per case. The variables are
- Subject 1dentification code
- Interest in topic (politics)
- Sex (0=Male, 1=Female)
- Age category
- Noise level
- Time (Order of noise level presentation)
- Discrimination score
The noise data is a constructed data set.
- Pain.xls: Arthritis patients rated
their pain under 3 dosage levels of two different drugs. Every patient
tried each combination of drug and dosage level. This is a constructed
data set.
- Student's Sleep Data: The reference is Student (1908). The probable
error of a mean," Biometrika 6, 1-25. These data are surely in the
public domain by now. The data file contains two variables for ten
patients suffering from insomnia. Each variable is actually a difference,
representing how much extra sleep a patient got when taking a sleeping
pill, compared to a baseline measurement. Drug 1 is Dextro-hyoscyamine
hydrobomide, while Drug 2 is Laevo-hyoscyamine hydrobomide.
- The mantids data
Mantids are insects, kind of like crickets or grasshoppers. When
frightened, they emit loud noises that function as alarm calls. I believe
they make the sounds by rubbing their hind legs together. The frequency
(number of calls per minute) may indicate how alarmed the mantids are.
In this study, caged mantids (either Female or Male) were randomly
assigned to be exposed to one of four predators (canaries), and the number
of alarm calls per minute was recorded. Each mantid was tested at three
distances from the predator: 8 cm, 13 cm and 18 cm. The three distances
were presented in different randomly chosen orders. There are three lines
of data per insect. The variables are
- Case identification number: Repeated on each line.
- Sex: 0=Male, 1=Female. Repeated on each line.
- Order of presentation: Repeated on each line.
- Predator (bird): Numbered 1-4. Repeated on each line.
- Distance: Distance from the predator in centimeters (1=8cm,
2=13cm, 3=18cm).
- Number of alarm calls per minute
The Mantids data are real, copyright 2008 Stephanie Hill. Dr. Hill gave
permission to use them for non-commercial purposes, but did not
specifically authorize their protection under the Creative Commons license.
- distract.data.txt: In a
study of the psychology of attention, subjects attempted to solve word
problems while listening to distracting backgound noise. The distracting
material was either music, or spoken words related to the problem they
were trying to solve. The distracting material was presented at three
different levels of loudness. Each subject attempted 10 problems at each
combination of loudness and type of distraction, for a total of 60
problems. Order of presentation was randomized. Data for each subject are
number correct in each of the six treatment combinations. This simulated
data set is protected by the Creative Commons license.
- MFdistract.data.txt: A version of distract.data.txt with gender.
- CO2.data.txt: The CO2 uptake of
six plants from Quebec and six plants from Mississippi was measured at
several levels of ambient CO2 concentration. Half the plants of each type
were chilled overnight before the experiment was conducted. This is an R
data set, and may be freely copied, used and distributed.
- Raptors06-07.data.txt: These are original data from the Toronto Raptors' 2006-2007 season. For
each regular season and playoff game, the following variables were recorded:
- Date
- Home or Away game
- Opponent
- Won or lost
- Days since last game
- Points scored by the Raptors
- Points scored by opponents
- Opponents' won-lost record the preceding year.
- Lynch.data.txt: These data come from Hovland and Sears' (1940) classic study of lynchings and the price of Cotton in the Southern U.S.. The data were hand typed, copied from a journal article published in 1940. I believe they are in the public domain.
- tubes.data.txt: Fungus was grown in a test tubes for 14 days. Twice a day, the investigators' graduate students measured the length of the fungus in the tube, the fuzzy "leading edge" of the growth, and the number of sclerotia, which are like little seed pods; actually they're spores, not seeds. At the end of the experiment, they also weighed the sclerotia and calculated the slope of the least-squares line relating day to length. There are 14 lines of data per case. On each line, the variables are
- Line number
- Mycelial Compatibility Group (MCG): Genetic type of fungus
- Replication (1-4)
- Day
- Morning length
- Morning number of sclerotia
- Morning leading edge
- Evening length
- Evening number of sclerotia
- Evening leading edge
- Least-squares slope of morning measurements
- Least-squares slope of evening measurements
- Total weight of sclerotia at the end of the experiment
The last 3 measurements are missing except on day 14. Thanks to Linda Kohn, who gives permission for these data to be used for non-commercial purposes.
- LittleTubes2.data.txt: This is a subset of the tubes data described above. It has just
- Line number 1-24
- Tube identification number
- Mycelial Compatibility Group (MCG): Genetic type of fungus
- Average of morning and evening length on day 10
- Average of morning and evening number of sclerotia on day 10
- Total weight of sclerotia at the end of the experiment
- Linear growth rate: Average of morning and evening least-squares slopes
Thanks again to Linda Kohn. These data (like the full set) have a nice
natural outlier, which the scientists attributed to contamination with
material the wrong mcg.
- TV1.data.txt:
This file contains data from a 1982 survey conducted in Stevens County in
the United States. Well, actually Stevens county is fictitious, and the
data were simulated using a program written by Ted Chang of the University
of Virginia (see The American Statistician, 46 (1992), 232-237 for more
information), but the details are realistic -- or anyway, they were
realistic in 1982. The imaginary "Stevens County" is divided into 75
districts including rural, small-town and urban areas. For each of 500
households interviewed, the data file contains district number, household
number within district, assessed value of home in US dollars (an indirect
measure of income, which was not asked), and answers to 9 questions related
to the respondents' interest in getting cable TV. The variables are:
- District: 1-25 are rural, 26-50 small town, 51-75 city.
- Household (numbered within district)
- Assessed value of home in US dollars
- Number of persons 12 and older in household
- Number of persons 11 and younger in household
- Number of TV sets in Household
- Price willing to pay for cable TV
- Total TV hours watched last week (add hours for all persons in
household)
- Hours Public Affairs watched last week
- Hours Sports watched last week
- Hours Children's programming watched last week
- Hours Movies watched last week
- TV2.data.txt: This is a second
independent random sample of size 500, like TV1.data.txt.
- BeatTheBlues.data.txt: This R data set
contains data from a longitudinal clinical trial of an interactive, multimedia program known as "Beat the Blues" designed to deliver cognitive behavioural therapy to depressed patients via a computer terminal. Patients with depression recruited in primary care were randomised to either the Beating the Blues program, or to "Treatment as Usual" (TAU). The variables are
- id: Patient identification code
- drug: Did the patient take anti-depressant drugs (No or Yes).
- length: The length of the current episode of depression, a factor with values <6m (less than six months) and >6m (more than six months).
- treatment: Treatment group, a factor with levels TAU (treatment as usual) and BtheB (Beat the Blues)
- bdi_pre: Beck Depression Inventory score before treatment.
- bdi_2m: Beck Depression Inventory score after two months
- bdi_4m: Beck Depression Inventory score after four months
- bdi_6m: Beck Depression Inventory score after six months
- bdi_8m: Beck Depression Inventory score after eight months
- ToothGrowth.data.txt: The
response is the length of odontoblasts (teeth) for 10 guinea pigs at each
of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two
delivery methods (orange juice or ascorbic acid). This is the R dataset
ToothGrowth.
- rock1.data.txt: Lateral support
and breaking force for a sample of rock cores. This is a constructed data
set based closely on a consulting project, an interesting combination of
Geology and Engineering. I believe that the cores were randomly assigned
to different values of lateral support.
- openSENIC.data.txt: The Study on the Efficacy of Nosocomial Infection Control (SENIC) data are from a study of infections acquired in hospital. That is, patients are admitted to hospital for something, and while in hospital they get infections (such as pneumonia and urinary tract infections) that are unrelated to why theyt were admitted, and require treatment. This is a partial reconstructed data set based on one in Kutner et al.'s Applied Linear Statistical Models. In this aggregated data set, the cases are 100 U.S. hospitals. The variables are
- Hospital identification number
- Geographic region of U.S.
- Medical school affiliation (Yes or No)
- Number of patients
- Number of beds in the hospital
- Number of nurses
- Mean length of stay in days
- Mean age of patient in years
- xratio: Ratio of number of X-rays performed to number of patients
without signs or symptoms of pneumonia, times 100
- culratio: Ratio of number of cultures performed to number of patients without
signs or symptoms of hospital-acquired infection, times 100
- Percentage of patients who acquired an infection while in hospital.
Sometimes mini-epidemics spread through a hospital. The variables xratio and culratio represent special efforts to monitor the health of patients who show no signs of having gotten sick in hospital, yet. They are a kind of early warning system, intended to detect outbreaks of disease in the hospital so they can be dealt with before they get established. My guess is that xratio is primarily for pneumonia, and culratio is primarily for urinary tract infections.
- openSENIC0.data.txt: The open SENIC data without missing values.
- openSENIC2.data.txt: The open SENIC data with periods and some other strange codes instead of NAs for the missing values. Also for region, 1 = Northeast, 2 = North Central, 3 = South, 4 = West.
- HandEar.data.txt: The hand-ear dichotic listening study.
Left-handed and right-handed subjects push a key when they hear their
names over background noise. They are wearing stereo headphones. The signal
comes in the left ear, the right ear, or both. There are 50 trials in each
condition, presented in a different random order for each subject. The
response variable is median reaction time in milliseconds. Each subject
contributes 3 medians. These are simulated data.
- Bball1.data.txt: Right handed
basketball players take right and left-handed hook shots from the left
baseline, the right baseline and the middle. Hit or miss is recorded for
each shot. These are simulated data.
- Bball10.data.txt: Like
Bball1.data.txt, except that the players take 10 shots. Number of hits from
each position with each hand is recorded.
- TVshows.data.txt: In a study
of consumers' opinions of 5 popular TV programmes, 240 consumers who watch
all the shows at least once a month completed a computerized interview. On
one of the screens, they indicated how much they enjoyed each programme by
mouse-clicking on a 10cm line. One end of the line was labelled ``Like very
much," and the other end was labelled ``Dislike very much." So each
respondent contributed 5 ratings, on a continuous scale from zero to
ten. The study was commissioned by the producers of one of the shows, which
will be called ``Programme E." Ratings of Programmes A through D
were expressed as percentages of the rating for Programme E, and these
were described as ``Liking indexed to programme E."
- Univariate data sets (for estimation exercises). These are all
simulated data.
- awards.data.txt: Awards
received by students at a particular high school are thought to occur
according to a Poisson process. That is, the numbers of awards received by
students in one year are independent Poisson random variables, with mean
λ that may depend on characteristics of the student. The variables
are Student identification code, Number of awards, Program (1=General,
2=Academic, 3=Vocational), and Score on a test of general academic
knowledge.
- bodyfat.data.txt:
Percentage of body fat, age, weight, height, and ten body circumference
measurements (e.g., abdomen) are recorded for 252 men. Body fat, a
measure of health, is estimated through an underwater weighing
technique. Fitting body fat to the other measurements using multiple
regression provides a convenient way of estimating body fat for men
using only a scale and a measuring tape.
These data were generously supplied by Dr. A. Garth Fisher, Human Performance
Research Center, Brigham Young University, Provo, Utah 84602, who gave
permission to freely distribute the data and use them for non-commercial
purposes. See the article "Fitting Percentage of Body Fat to Simple Body
Measurements" in the Journal of Statistics Education (Johnson 1996).
VARIABLE DESCRIPTIONS:
Columns
3 - 5 Case Number
10 - 13 Percent body fat using Brozek's equation,
457/Density - 414.2
18 - 21 Percent body fat using Siri's equation,
495/Density - 450
24 - 29 Density (gm/cm^3)
36 - 37 Age (yrs)
40 - 45 Weight (lbs)
49 - 53 Height (inches)
58 - 61 Adiposity index = Weight/Height^2 (kg/m^2)
65 - 69 Fat Free Weight
= (1 - fraction of body fat) * Weight,
using Brozek's formula (lbs)
74 - 77 Neck circumference (cm)
81 - 85 Chest circumference (cm)
89 - 93 Abdomen circumference (cm) "at the umbilicus
and level with the iliac crest"
97 - 101 Hip circumference (cm)
106 - 109 Thigh circumference (cm)
114 - 117 Knee circumference (cm)
122 - 125 Ankle circumference (cm)
130 - 133 Extended biceps circumference (cm)
138 - 141 Forearm circumference (cm)
146 - 149 Wrist circumference (cm) "distal to the
styloid processes"
- Diversity: Employees at Canadian corporations filled out
questionnaires about their jobs. Questionnaires employed 5-point scales,
where 5 indicates the highest level of the trait or opinion being assessed
(like job satisfaction) and 1 indicating the lowest level. The wording of
the questions was varied so that sometimes a 1 indicated higher
satisfaction (for example, strong disagreement with "I hate my job."), but
the numbers were switched around so that in the data file, larger numbers
always indicate more. Data consist of answers to
- Ten questions about committment to the organization, with higher
numbers indicating more committment.
- Five questions about relations with colleagues at work, with
higher numbers indicating better relations.
- Twelve questions about relations with magnagement, in particular
the respondent's immediate boss. Higher numbers indicate better
relations.
- Six questions about fair opportunities for advancement, with
higher numbers indicating more fairness.
- Four questions about job satisfaction, with higher numbers
indicating more satisfaction.
- Three questions about senior management's committment to
diversity, with higher numbers indicating more committment.
These seem to be on a six-point scale instead of five.
- Gender: 0=Male, 1=Female
- Visible Minority status: 0=No, 1=Yes
- Education level, numbered 1-7. The exact meanings of the numbers
are unknown, but surely higher numbers must indicate more education,
mostly.
- Marital status: 1=never married, 2=married, 3=divorced or
separated, 4=widowed. This is a guess, but I'm fairly confident.
- Age in years
- Born outside Canada: 0=No, 1=Yes
There are two data sets of size n=500, randomly sampled from around 16,000
questionnaires. The idea is to arrive at conclusions and predictions based
on the exploratory sample, and then test them out on the replication
sample.
- Auto repair: A car rental company randomly assigned automobiles to one of three maintenance programs. Outcome was whether the car needed repairs, specifically repairs not required because of an accident or because of superficial damage like dings and chips to the windshield. Data were recorded in each of 12 successive months (the company only keeps the cars for one year, then sells them). There are 438 cars in the sample. For each month, variables are
- Car identification number
- Maintenance program (1 2 3)
- Month
- Cumulative number of customers who have rented the car
- Cumulatie number of kilometers driven
- Repair (0=No, 1=Yes )
The data are available as an Excel spreadsheet and in plain text.
- bweight.data.txt: These data
come from a sample of mothers who recently had a baby. This is an R data
set that comes with the MASS package, so I assume there are no copyright
restrictions.
- low: Indicator of birth weight less than 2.5 kg. This
is clinically meaningful because babies in that category tend to
have health problems.
- age: Mother's age in years.
- lwt: Mother's weight in pounds at last menstrual period.
- race: Mother's race (1 = white, 2 = black, 3 = other).
- smoke: Smoking status during pregnancy.
- ptl: Number of previous premature labours.
- ht: History of hypertension.
- ui: Presence of uterine irritability.
- ftv: Number of physician visits during the first
trimester.
- bwt: Baby's birth weight in grams.
- Diet.xlsx: These data are from a study of
people trying to lose weight. Variables are
- Person: Identification code
- gender: 0=F, 1=M
- Age: In years
- Height: In cm.
- pre.weight: Weight in kg. before starting the diet.
- Diet: 1, 2, or 3, randomly assigned
- weight6weeks: Weight in kg. after 6 weeks on the diet
The diet data are from a University of Sheffield website:
https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv
I assume they were intended to be shared.
-
HS-Program-Choice.data.txt: Incoming high schol students
choose their programs of study. Variables are
- Gender: 0=Male, 1=Female
- Socioeconomic status: 1, 2, 3
- Math score
- Reading score
- Science score
- Social studies score
- Writing score
- Program choice: 1=general, 2=academic, 3=vocational
The program choice data are from UCLA:
https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression
- Rossi.data.txt: The recidivism data are from a paper by Rossi et al. (1980). I got them from the RcmdrPlugin.survival package. Convicts released from prison in Maryland in the 1970s were randomly assigned to either receive financial aid or not. They were followed for one year to see if they were re-arrested, and if so, how soon. The 62 variables are
- week: Week of first arrest after release or censoring; all censored observations are censored at 52 weeks.
- arrest: 1 if arrested, 0 if not arrested.
- fin: Financial aid: 0=no, 1=yes.
- age: Age: in years at time of release.
- race: Black or other.
- wexp: Full-time work experience before incarceration: 0=no, 1=yes.
- mar: Marital status at time of release: married or not married.
- paro: Released on parole? 0=no, 1=yes.
- prio: Number of convictions prior to current incarceration.
- educ: Level of education: 2 = 6th grade or less; 3 = 7th to 9th grade; 4 = 10th to 11th grade; 5 = 12th grade; 6 = some college.
- emp1: Employment status in the first week after release: 0=no, 1=yes.
- emp2: As above.
- . . .
- emp52: As above.
- RECID.dat.txt: Rossi's recidivism
data without the header, and with periods for missing values. This format
is more suitable for analysis with SAS.
- Rossi-ss.data.txt: Rossi's
recidivism data in a start-stop format suitable for survival analysis with
employment status as a time-dependent covariate. There is one line of data
per week, for a total of 19,809 data lines.
- liver.data.txt: In the liver
disease data, patents were randomly assigned to one of two drugs, or to a
placebo. The data file includes age and sex (1=F). Blood platelet count was
recorded for each patient in each time period. This data set is in a
start-stop format suitable for survival analysis with platelet count as a
time-dependent covariate. These are simulated data.
- xy.data.txt: Simulated data for
simple regression through the origin. X is Poisson(3), epsilon is
U(-15,15), and true beta = 2.
- TinyWLS.data.txt: Another
simulated data set like xy.data.txt. This time, true beta = 0.25, x is
rounded U(1,5) and epsilon is normal, with variance proportional to x^2.
The weighted least squares estimate of beta is mean(y/x).
- titanic.data.txt: The Titanic
was a passenger ship that hit an iceberg and sank on its very first voyage
in 1912. It was the largest passenger ship in the world at the time, and
supposedly unsinkable. More than 1,500 of the roughly 2,200 passengers and
crew died. This file is from an R data set. More details are given at the
beginning of the file.
- selfesteem3.data.txt: This is the selfesteem2 data set from the R datarium package (with cases re-ordered to allow a multivariate data read). "Data are the self esteem score of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials.
The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials.
The same 12 participants are enrolled in the two different trials with enough time between trials." This means that every subject was in all 6 conditions.
- airquality.data.txt: This is the R dat set airquality. It has daily readings of the following variables from May 1, 1973 to September 30, 1973, in New York City.
- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
- SolarRad: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
- teengamb.data.txt: Teen
gambling data set from Faraway's Linear models with R
- sex: 0=male, 1=female
- status: Socioeconomic status score based on parents'
occupation
- income: in pounds per week
- verbal: verbal score in words out of 12 correctly defined
- gamble: expenditure on gambling in pounds per year
- bodymind.data.txt: Educational test
scores and physical measurements for a sample of high school students. This
is a modified subset of data reported in the journal Human Biology.
The reference is
Clark, P. J., Vandenberg, S. G., and Proctor, C. H. (1961), On the
relationship of scores on certain psychological tests with a number of
anthropometric characters and birth order in twins, Human Biology,
33, 163-180.
The data are used without permission, but I believe they have been
modified enough so that the original copyright no longer applies, and they
can be protected under a Creative Commons license. The variables are
- sex: F or M
- progmat: Progressive matrices (puzzle) score
- reason: Reasoning score
- verbal: Verbal (reading and vocabulary) score
- headlng: Head Length in mm
- headbrd: Head Breadth in mm
- headcir: Head Circumference in mm
- bizyg: Bizygomatic breadth in mm, basically how far apart
the eyes are.
- weight: In pounds
- height: In cm.
- Berkeley.data.txt
- training.data.txt: Office
workers at a large insurance company are randomly assigned to one of 3
computer use training programmes, and their number of calls to IT
support during the following month is recorded. Additional information
on each worker includes years of experience and score on a computer
literacy test (out of 100). It is reasonable to model calls to IT
support as a Poisson process, and the question is whether training
programme affects the rate of the process.
- deathpen3.data.txt: Prisoners who were convicted of murder in Florida were classified as either Black or White, their victims were either Black or White, and they either got the death penalty, or they did not. I have a reference for this -- American Sociological Review or something ...
- choice.data.txt: In the Program Choice data, graduating grade eight students were choosing their High School program. The potential choices were Academic, General and Vocational. Predictor variables are gender, socioeconomic status, and scores on reading, writing math science standardized tests.
- ltc.data.txt: LTC stands for Long Term Care. Operators of long-term care homes are very interested in whether their elderly resi-
dents are going to survive, because they need to plan. In one study, the variables for
a sample of residents were
- One year survival (1=Yes, 0=No)
- Age in years
- Gender (1=F, 0=M)
- Indicator for dementia (1=Yes, 0=No)
To get a Yes for dementia, it has to be pretty serious, so that the person
cannot safely go outside without supervision.
- lognorm1.data.txt: Right
censored data from a log-normal distribution. These are simulated data,
with true μ = 0 and σ2 = 1.
- ColonCancer.data.txt:
This is a subset of the colon data set from the survival
package. A sample of advanced colon cancer patients had surgery that
removed all detectable cancer. Patients were randomly assigned to one of
three drug treatments:
- Obs} Just observed, without any drug.
- Lev: Levamisole, a low-toxicity compound previously used to treat worm infestations in animals
- Lev+5FU: A combination of levamisole and 5-FU, a ``moderately" toxic chemotherapy agent.
The variables are
- rx: Drug treatment group
- sex: 0=Female, 1=Male
- age: Age in years
- nodes: Number of lymph nodes affected
- status: 0=Right censored, 1=Uncensored
- time: Time until censoring or recurrence of the cancer
- area51.data.txt:
If you spend time on the right social media sites, you will have
heard of Area 51, a restricted region in the Nevada desert. It is
widely believed that if you go hiking in Area 51, you have a good
chance of being kidnapped by space aliens. There are indications that
whether you are wearing a hat matters. To test this idea and in the
interests of transparency, volunteers (there are plenty) went walking
in the desert under controlled conditions.
Each day for up to 30 days, the volunteer was dropped off by
helicopter at a random location in Area 51. The volunteer was either
wearing a hat or not, determined by a coin toss at the beginning of
each day. Area 51 is out of ordinary cell phone range, but the
U. S. military has a cell phone tower there, enabling the
experimenters to stay in contact with the volunteers, and to determine
their exact location.
Kidnapping has a distinct signature. The volunteer's cell phone
signal is suddenly interrupted. A helicopter is dispatched immediately
to their last location and a search is initiated, but the volunteer is
never found.
If a volunteer is not kidnapped on a particular day, then after
exactly eight hours, the helicopter picks up the volunteer for
transport back to the base. If the volunteer is not kidnapped within 30
days, the observation is censored. Censoring can occur earlier if the
volunteer withdraws from the study, or suffers a medical emergency.
The file area51.data.txt
contains a subset of the data from 200 volunteers, in a stop-start
format. Variables include age, sex, an indicator for wearing a hat
(which varies over time), event times, and a binary variable called
taken, which equals one if a kidnapping occurred, and zero
if there was no kidnapping. Event times are in minutes, with 8*60=480
minutes in an eight hour day. The clock stops at the end of each
eight hour day, and re-starts when the volunteer is dropped off in
the desert the next morning. Thanks to General Buck Turgidson for
permission to use these data.
- expseries.data.txt:
Three experimental treatments and a single response variable, but the
data are collected in order one case at a time, so there might be a
time seies structure. These are simulated data.