Legal data

Legal Data

These data sets may be used for any non-commercial purpose. Here are some definitions.

Original data: This is the real thing. Owners of the data have given permission to use the data for non-commerical purposes.
Simulated data: Data produced by random number generation, usually with R.
Constructed data: Data are made up, using a combination of random number generation and manual editing.
Re-constructed data: I start with a set of statistics derived directly or indirectly from a published source, and then simulate data that yield roughly (but not exactly) the same values of the statistics. I freely round the simulated data, change the sample size, and even add variables that the investigators probably would have measured, given sufficient resources. Finally, I modify the data in any other way I can think of to make the example more instructive.

Data files

mcars4.data: A constructed data set based roughly on the variables in an SPSS data set.
mcars4b.data.txt: mcars4.data without the header that R uses.
mcars4.xlsx: mcars4.data in an Excel spreadsheet.
mathtest.txt: Before the beginning of the Fall term, students in a first-year Calculus class took a diagnostic test with two parts: Pre-calculus and Calculus. Their High School Calculus marks and their marks in University Calculus were also available. In order, the variables in the data file are: Identification code, Mark in High School Calculus, Score on the Pre-calculus portion of the diagnostic test, Score on the Calculus portion of the diagnostic test, and mark in University Calculus. Thanks to Dr. Cleo Boyd for permission to use these original data.
math.data.txt: This is a fuller version of the data given in mathtest.txt above. Thanks again to Cleo Boyd. The variables are
1. Identification code
2. Course: 1=Catch-up 2=Mainstream 3=Elite 4=NoResponse
3. Score on pre-calculus part of diagnostic test
4. Score on calculus part of diagnostic test
5. High School GPA
6. High School Calculus mark
7. High School English mark
8. University Calculus mark
9. First language
10. Sex
11. National background according to rater one
12. National background according to rater two
  1. Chinese
  2. Japanese
  3. Korean
  4. Vietnamese
  5. Other Asian
  6. Eastern European
  7. Hispanic
  8. English-speaking
  9. French
  10. Italian
  11. Greek
  12. Germanic
  13. Other European
  14. Middle-Eastern
  15. Pakistani
  16. East Indian
  17. Sub-Saharan
  18. OTHER or DK
13. Sample: 1=Exploratory, 2=Replication (Students were randomly divided into Exploratory and Replication samples.)
Here are the Exploratory and Replication data in two separate files. The exploratory data set has a collapsed coding of nationality.
- mathexplore.data.txt
  1. Asian
  2. Eastern European
  3. European not Eastern
  4. Middle-Eastern and Pakistani
  5. East Indian
  6. Other and DK
- mathreplic.data.txt
math1.data.txt: A cut down, R friendly version of exploremath.data.txt. The last 3 columns of exploremath.data.txt have been omitted. These are real data, with 776 missing values coded as NA. These data have rough edges that have deliberately not been fixed.
mathcat.data.txt: An even more cut down and sanitized version of exploremath.data.txt, with no missing values and catgorical outcomes. The variables are
- hsgpa
- hsengl
- hscalc
- course (Catch-up, Elite, Mainstrm)
- passed (No, Yes
- outcome (Passed, Failed, Disappeared)
mathcat-replic.data.txt: Like mathcat, but for the replication sample.
marks.data.txt: Quiz average, Average on computer assignments, Midterm and Final from a class years ago. These are original data.
openSAT.data.txt: SAT Verbal, SAT Math and first year GPA. This is a reconstructed data set based on a Minitab data set.
Babydouble.data.txt: Simulated W₁, W₂, Y for an easy double measurement regression example.
bmihealth.data.txt: A constructed data set illiustrating the advantages of the double measurement design. Health-related information was obtained from participants independently, in two sets of measurements. Need to give more information!
openpigs.data.txt: Reconstructed data based on a data set in Fuller's Measurement error models.
- Col 1: Farm
- Col 2: Number breeding sows on farm June 1: Questionnaire One
- Col 3: Number sows giving birth June 1 to Aug. 31: Questionnaire One
- Col 4: Number breeding sows on farm June 1: Questionnaire Two
- Col 5: Number sows giving birth June 1 to Aug. 31: Questionnaire Two
Circle.data.txt: Data were simulated from a model that fails the Parameter Count Rule, but the parameters are all identifiable where β₁ = β₂ = 0. The variables Y₁ and Y₂ were generated with true β₁² + β₂² = 1, while Y₃ and Y₄ were generated with true β₁ = β₂ = 0.
timmy1.data.txt: A major Canadian coffee shop chain is trying to break into the U.S. Market. They assess the following variables twice on a random sample of coffee-drinking adults. The two measurements of each variable are conducted at different times by different interviewers asking somewhat different questions, in such a way that the errors of measurement may be assumed independent. The variables are
- Brand Awareness: Familiarity with the coffee shop chain
- Advertising Awareness: Recall for advertising of the coffee shop chain
- Interest in the product category: Mostly this was how much they say they like doughnuts.
- Purchase Intention: Expressed willingness to go to an outlet of the coffeeshop chain and make an order.
- Purchase behaviour: Reported dollars spent at the chain during the 2 months following the interview.
All variables were measured on a scale from 0 to 100 except purchase behaviour, which is in dollars.
```
           w1 = Brand Awareness 1
           w2 = Brand Awareness 2
           w3 = Ad Awareness 1
           w4 = Ad Awareness 2
           w5 = Interest 1
           w6 = Interest 2
           v1 = Purchase Intention 1
           v2 = Purchase Intention 2
           v3 = Purchase Behaviour 1
           v4 = Purchase Behaviour 2;
```
fish.txt: Original data submitted to the Journal of Statistical Education by Juha Puranen, Department of Statistics, University of Helsinki, Finland.
fish.data.txt: The fish data without the description.
LittleStatclassdata.txt: Quiz average, Computer assignment average, Midterm score and Final Exam score from a statistics class, long ago.
LittleStatClassData2.txt: LittleStatclassdata.txt with two additional categorical variables: Sex and Race. These are just my guesses, of course.
LittleStatClassData2b.txt: LittleStatClassData2.txt without the header that R uses.
statclass.data.txt: The full version of the statclass data, with individual quiz scores and computer assignment scores.
pigweight.data.txt: Pigs are routinely given large doses of antibiotics even when they show no signs of illness, to protect their health under unsanitary conditions. Pigs were randomly assigned to one of three antibiotics. Dressed weight (weight of the pig after slaughter and removal of head, intestines and skin) was the response variable. Mother's and father's live adult weight were used as covariates. This is a simulated data set.
ScabDisease.data.txt: This is a reconstructed data set based on an example in Cochran and Cox's (1958) classic text Experimental design.. The original data appear on page 97 of Cochran and Cox's book.
Scab disease is a fungal infection that affects potatoes. The fungus does not grow well in acidic soil, so investigators designed a study to see whether adding sulphur to the soil would reduce the scab disease. In a completely randomized design, plots of land were randomly assigned to either a control condition or to several levels of sulphur that was spread on the land in the Spring or Fall. The amounts of sulphur were either 300 pounds per acre, 600 pounds per acre or 1200 pounds per acre. The potatoes were harvested at the end of the growing season. One hundred potatoes were randomly selected from each plot of land. The potatoes were washed, and then a lab assistant estimated the percent of each potato's surface that was infected with scab disease. The response variable is, for each plot of land, the mean percent of the potato's surface covered with scab disease. The explanatory variable is pounds of sulphur, in hundreds of pounds; the control is zero.
ScabDisease.xlsx: The reconstructed scab disease data in a MS Excel spreadsheet.
ChickWeight.data.txt: This is an R dataset called chickwts. In the Chick Weights study, newly hatched chickens were randomly assigned to one of six different feed supplements, and their weight in grams after 6 weeks was recorded.
ChickWeight.xlsx: The Chick Weight data as an Excel spreadsheet.
ChickenWeight.data.txt: This is getting confusing. This is the R data set ChickWeight, as distinct from chickwts. Borrowing some words from the R documentation,

These data are from an experiment on the effect of diet on early growth of chicks. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups of chicks on different protein diets. The variables are
- weight: Body weight in grams
- Time: Days since birth
- Chick: Identification code
- Diet: 1, 2, 3 or 4
sales.data.txt: Telephone sales representatives use computer software to help them locate potential customers, answer questions, take credit card information and place orders. Twelve sales representatives were randomly assigned to each of three new software packages the company was thinking of purchasing. The data for each sales representative include the software package (1, 2 or 3), sales last quarter with the old software, and sales this quarter with one of the new software packages. Sales are in number of units sold. This is a constructed data set that illustrates interactions.
sales.data.xlsx: The telephone sales data as an Excel spreadsheet.
bunnies.data.txt: An experiment in dentistry was designed to test the effectiveness of a drug (HEBP) that is supposed to help dental implants become more firmly attached to the jaw bone. This is an initial test on animals. False teeth were implanted into the leg bones of rabbits, and the rabbits were randomly assigned to receive either the drug or a saline solution (placebo). Technicians administering the drug were blind to experimental condition.
Rabbits were also randomly assigned to be "sacrificed" after either 3, 6, 9 or 12 days. At that time, the implants were pulled out of the bone by a machine that measures force in newtons and stiffness in newtons/mm. For both of these measurements, higher values indicate more healing. A measure of "pre-load stiffness" in newtons/mm is also available for each animal. This may be another indicator of how firmly the false tooth was implanted into the bone, but it might even be a covariate. Nobody can seem to remember what "preload" means.
The variables are
1. Identification code
2. Time (3,6,9,12 days of healing)
3. Drug (1=HEBP, 0=saline solution)
4. Stiffness in newtons/mm
5. Force in newtons
6. Preload stiffness in newtons/mm
These are original data. Thanks to Dr. E. Fudd of the University of Transylvania, who has given permission to use these data for non-commercial purposes.
mixedup.data.txt: Data from a balanced 3x4 design -- no content. They are useful for illustrating nested and mixed models.
noise.data.txt: Participants listened to brief political discussions under 5 levels of background noise. "Discrimination score" is a measure of how well they could tell what was being said. There are 5 lines of data per case. The variables are
1. Subject 1dentification code
2. Interest in topic (politics)
3. Sex (0=Male, 1=Female)
4. Age category
5. Noise level
6. Time (Order of noise level presentation)
7. Discrimination score
The noise data is a constructed data set.
Pain.xls: Arthritis patients rated their pain under 3 dosage levels of two different drugs. Every patient tried each combination of drug and dosage level. This is a constructed data set.
Student's Sleep Data: The reference is Student (1908). The probable error of a mean," Biometrika 6, 1-25. These data are surely in the public domain by now. The data file contains two variables for ten patients suffering from insomnia. Each variable is actually a difference, representing how much extra sleep a patient got when taking a sleeping pill, compared to a baseline measurement. Drug 1 is Dextro-hyoscyamine hydrobomide, while Drug 2 is Laevo-hyoscyamine hydrobomide.
- studentsleep.data.txt: Plain text file
- studentsleep.xlsx: Excel spreadsheet
The mantids data
- mantids.data.txt: Plain text file
- Mantids.xls: Excel spreadsheet
Mantids are insects, kind of like crickets or grasshoppers. When frightened, they emit loud noises that function as alarm calls. I believe they make the sounds by rubbing their hind legs together. The frequency (number of calls per minute) may indicate how alarmed the mantids are.
In this study, caged mantids (either Female or Male) were randomly assigned to be exposed to one of four predators (canaries), and the number of alarm calls per minute was recorded. Each mantid was tested at three distances from the predator: 8 cm, 13 cm and 18 cm. The three distances were presented in different randomly chosen orders. There are three lines of data per insect. The variables are
- Case identification number: Repeated on each line.
- Sex: 0=Male, 1=Female. Repeated on each line.
- Order of presentation: Repeated on each line.
- Predator (bird): Numbered 1-4. Repeated on each line.
- Distance: Distance from the predator in centimeters (1=8cm, 2=13cm, 3=18cm).
- Number of alarm calls per minute
The Mantids data are real, copyright 2008 Stephanie Hill. Dr. Hill gave permission to use them for non-commercial purposes, but did not specifically authorize their protection under the Creative Commons license.
distract.data.txt: In a study of the psychology of attention, subjects attempted to solve word problems while listening to distracting backgound noise. The distracting material was either music, or spoken words related to the problem they were trying to solve. The distracting material was presented at three different levels of loudness. Each subject attempted 10 problems at each combination of loudness and type of distraction, for a total of 60 problems. Order of presentation was randomized. Data for each subject are number correct in each of the six treatment combinations. This simulated data set is protected by the Creative Commons license.
MFdistract.data.txt: A version of distract.data.txt with gender.
CO2.data.txt: The CO2 uptake of six plants from Quebec and six plants from Mississippi was measured at several levels of ambient CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted. This is an R data set, and may be freely copied, used and distributed.
Raptors06-07.data.txt: These are original data from the Toronto Raptors' 2006-2007 season. For each regular season and playoff game, the following variables were recorded:
- Date
- Home or Away game
- Opponent
- Won or lost
- Days since last game
- Points scored by the Raptors
- Points scored by opponents
- Opponents' won-lost record the preceding year.
Lynch.data.txt: These data come from Hovland and Sears' (1940) classic study of lynchings and the price of Cotton in the Southern U.S.. The data were hand typed, copied from a journal article published in 1940. I believe they are in the public domain.
tubes.data.txt: Fungus was grown in a test tubes for 14 days. Twice a day, the investigators' graduate students measured the length of the fungus in the tube, the fuzzy "leading edge" of the growth, and the number of sclerotia, which are like little seed pods; actually they're spores, not seeds. At the end of the experiment, they also weighed the sclerotia and calculated the slope of the least-squares line relating day to length. There are 14 lines of data per case. On each line, the variables are
- Line number
- Mycelial Compatibility Group (MCG): Genetic type of fungus
- Replication (1-4)
- Day
- Morning length
- Morning number of sclerotia
- Morning leading edge
- Evening length
- Evening number of sclerotia
- Evening leading edge
- Least-squares slope of morning measurements
- Least-squares slope of evening measurements
- Total weight of sclerotia at the end of the experiment
The last 3 measurements are missing except on day 14. Thanks to Linda Kohn, who gives permission for these data to be used for non-commercial purposes.
LittleTubes2.data.txt: This is a subset of the tubes data described above. It has just
- Line number 1-24
- Tube identification number
- Mycelial Compatibility Group (MCG): Genetic type of fungus
- Average of morning and evening length on day 10
- Average of morning and evening number of sclerotia on day 10
- Total weight of sclerotia at the end of the experiment
- Linear growth rate: Average of morning and evening least-squares slopes
Thanks again to Linda Kohn. These data (like the full set) have a nice natural outlier, which the scientists attributed to contamination with material the wrong mcg.
TV1.data.txt: This file contains data from a 1982 survey conducted in Stevens County in the United States. Well, actually Stevens county is fictitious, and the data were simulated using a program written by Ted Chang of the University of Virginia (see The American Statistician, 46 (1992), 232-237 for more information), but the details are realistic -- or anyway, they were realistic in 1982. The imaginary "Stevens County" is divided into 75 districts including rural, small-town and urban areas. For each of 500 households interviewed, the data file contains district number, household number within district, assessed value of home in US dollars (an indirect measure of income, which was not asked), and answers to 9 questions related to the respondents' interest in getting cable TV. The variables are:
1. District: 1-25 are rural, 26-50 small town, 51-75 city.
2. Household (numbered within district)
3. Assessed value of home in US dollars
4. Number of persons 12 and older in household
5. Number of persons 11 and younger in household
6. Number of TV sets in Household
7. Price willing to pay for cable TV
8. Total TV hours watched last week (add hours for all persons in household)
9. Hours Public Affairs watched last week
10. Hours Sports watched last week
11. Hours Children's programming watched last week
12. Hours Movies watched last week
TV2.data.txt: This is a second independent random sample of size 500, like TV1.data.txt.
BeatTheBlues.data.txt: This R data set contains data from a longitudinal clinical trial of an interactive, multimedia program known as "Beat the Blues" designed to deliver cognitive behavioural therapy to depressed patients via a computer terminal. Patients with depression recruited in primary care were randomised to either the Beating the Blues program, or to "Treatment as Usual" (TAU). The variables are
- id: Patient identification code
- drug: Did the patient take anti-depressant drugs (No or Yes).
- length: The length of the current episode of depression, a factor with values <6m (less than six months) and >6m (more than six months).
- treatment: Treatment group, a factor with levels TAU (treatment as usual) and BtheB (Beat the Blues)
- bdi_pre: Beck Depression Inventory score before treatment.
- bdi_2m: Beck Depression Inventory score after two months
- bdi_4m: Beck Depression Inventory score after four months
- bdi_6m: Beck Depression Inventory score after six months
- bdi_8m: Beck Depression Inventory score after eight months
ToothGrowth.data.txt: The response is the length of odontoblasts (teeth) for 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid). This is the R dataset ToothGrowth.
rock1.data.txt: Lateral support and breaking force for a sample of rock cores. This is a constructed data set based closely on a consulting project, an interesting combination of Geology and Engineering. I believe that the cores were randomly assigned to different values of lateral support.
openSENIC.data.txt: The Study on the Efficacy of Nosocomial Infection Control (SENIC) data are from a study of infections acquired in hospital. That is, patients are admitted to hospital for something, and while in hospital they get infections (such as pneumonia and urinary tract infections) that are unrelated to why theyt were admitted, and require treatment. This is a partial reconstructed data set based on one in Kutner et al.'s Applied Linear Statistical Models. In this aggregated data set, the cases are 100 U.S. hospitals. The variables are
1. Hospital identification number
2. Geographic region of U.S.
3. Medical school affiliation (Yes or No)
4. Number of patients
5. Number of beds in the hospital
6. Number of nurses
7. Mean length of stay in days
8. Mean age of patient in years
9. xratio: Ratio of number of X-rays performed to number of patients without signs or symptoms of pneumonia, times 100
10. culratio: Ratio of number of cultures performed to number of patients without signs or symptoms of hospital-acquired infection, times 100
11. Percentage of patients who acquired an infection while in hospital.
Sometimes mini-epidemics spread through a hospital. The variables xratio and culratio represent special efforts to monitor the health of patients who show no signs of having gotten sick in hospital, yet. They are a kind of early warning system, intended to detect outbreaks of disease in the hospital so they can be dealt with before they get established. My guess is that xratio is primarily for pneumonia, and culratio is primarily for urinary tract infections.
openSENIC0.data.txt: The open SENIC data without missing values.
openSENIC2.data.txt: The open SENIC data with periods and some other strange codes instead of NAs for the missing values. Also for region, 1 = Northeast, 2 = North Central, 3 = South, 4 = West.
HandEar.data.txt: The hand-ear dichotic listening study.
Left-handed and right-handed subjects push a key when they hear their names over background noise. They are wearing stereo headphones. The signal comes in the left ear, the right ear, or both. There are 50 trials in each condition, presented in a different random order for each subject. The response variable is median reaction time in milliseconds. Each subject contributes 3 medians. These are simulated data.
Bball1.data.txt: Right handed basketball players take right and left-handed hook shots from the left baseline, the right baseline and the middle. Hit or miss is recorded for each shot. These are simulated data.
Bball10.data.txt: Like Bball1.data.txt, except that the players take 10 shots. Number of hits from each position with each hand is recorded.
TVshows.data.txt: In a study of consumers' opinions of 5 popular TV programmes, 240 consumers who watch all the shows at least once a month completed a computerized interview. On one of the screens, they indicated how much they enjoyed each programme by mouse-clicking on a 10cm line. One end of the line was labelled ``Like very much," and the other end was labelled ``Dislike very much." So each respondent contributed 5 ratings, on a continuous scale from zero to ten. The study was commissioned by the producers of one of the shows, which will be called ``Programme E." Ratings of Programmes A through D were expressed as percentages of the rating for Programme E, and these were described as ``Liking indexed to programme E."
Univariate data sets (for estimation exercises). These are all simulated data.
- cauchy.data.txt
- beta.data.txt
- mystery.data.txt
- gamma.data.txt
- poisson.data.txt
- normal.data.txt
- pareto.data.txt
- inversegamma.data.txt
- Weibull.data1.txt
- Weibull.data2.txt
- bowlhaz.data.txt
- expo.data2.txt
awards.data.txt: Awards received by students at a particular high school are thought to occur according to a Poisson process. That is, the numbers of awards received by students in one year are independent Poisson random variables, with mean λ that may depend on characteristics of the student. The variables are Student identification code, Number of awards, Program (1=General, 2=Academic, 3=Vocational), and Score on a test of general academic knowledge.

bodyfat.data.txt: Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) are recorded for 252 men. Body fat, a measure of health, is estimated through an underwater weighing technique. Fitting body fat to the other measurements using multiple regression provides a convenient way of estimating body fat for men using only a scale and a measuring tape.

These data were generously supplied by Dr. A. Garth Fisher, Human Performance Research Center, Brigham Young University, Provo, Utah 84602, who gave permission to freely distribute the data and use them for non-commercial purposes. See the article "Fitting Percentage of Body Fat to Simple Body Measurements" in the Journal of Statistics Education (Johnson 1996).

VARIABLE DESCRIPTIONS:
Columns
  3 -   5  Case Number
 10 -  13  Percent body fat using Brozek's equation, 
           457/Density - 414.2
 18 -  21  Percent body fat using Siri's equation, 
           495/Density - 450
 24 -  29  Density (gm/cm^3)
 36 -  37  Age (yrs)
 40 -  45  Weight (lbs)
 49 -  53  Height (inches)
 58 -  61  Adiposity index = Weight/Height^2 (kg/m^2)
 65 -  69  Fat Free Weight 
           = (1 - fraction of body fat) * Weight, 
           using Brozek's formula (lbs)
 74 -  77  Neck circumference (cm)
 81 -  85  Chest circumference (cm)
 89 -  93  Abdomen circumference (cm) "at the umbilicus 
           and level with the iliac crest"
 97 - 101  Hip circumference (cm)
106 - 109  Thigh circumference (cm)
114 - 117  Knee circumference (cm)
122 - 125  Ankle circumference (cm)
130 - 133  Extended biceps circumference (cm)
138 - 141  Forearm circumference (cm)
146 - 149  Wrist circumference (cm) "distal to the 
           styloid processes"

Diversity: Employees at Canadian corporations filled out questionnaires about their jobs. Questionnaires employed 5-point scales, where 5 indicates the highest level of the trait or opinion being assessed (like job satisfaction) and 1 indicating the lowest level. The wording of the questions was varied so that sometimes a 1 indicated higher satisfaction (for example, strong disagreement with "I hate my job."), but the numbers were switched around so that in the data file, larger numbers always indicate more. Data consist of answers to
- Ten questions about committment to the organization, with higher numbers indicating more committment.
- Five questions about relations with colleagues at work, with higher numbers indicating better relations.
- Twelve questions about relations with magnagement, in particular the respondent's immediate boss. Higher numbers indicate better relations.
- Six questions about fair opportunities for advancement, with higher numbers indicating more fairness.
- Four questions about job satisfaction, with higher numbers indicating more satisfaction.
- Three questions about senior management's committment to diversity, with higher numbers indicating more committment. These seem to be on a six-point scale instead of five.
- Gender: 0=Male, 1=Female
- Visible Minority status: 0=No, 1=Yes
- Education level, numbered 1-7. The exact meanings of the numbers are unknown, but surely higher numbers must indicate more education, mostly.
- Marital status: 1=never married, 2=married, 3=divorced or separated, 4=widowed. This is a guess, but I'm fairly confident.
- Age in years
- Born outside Canada: 0=No, 1=Yes
There are two data sets of size n=500, randomly sampled from around 16,000 questionnaires. The idea is to arrive at conclusions and predictions based on the exploratory sample, and then test them out on the replication sample.
- DiversityExplore.xlsx
- DiversityReplic.xlsx
Auto repair: A car rental company randomly assigned automobiles to one of three maintenance programs. Outcome was whether the car needed repairs, specifically repairs not required because of an accident or because of superficial damage like dings and chips to the windshield. Data were recorded in each of 12 successive months (the company only keeps the cars for one year, then sells them). There are 438 cars in the sample. For each month, variables are
- Car identification number
- Maintenance program (1 2 3)
- Month
- Cumulative number of customers who have rented the car
- Cumulatie number of kilometers driven
- Repair (0=No, 1=Yes )
The data are available as an Excel spreadsheet and in plain text.
- Spreadsheet
- Plain text
bweight.data.txt: These data come from a sample of mothers who recently had a baby. This is an R data set that comes with the MASS package, so I assume there are no copyright restrictions.
- low: Indicator of birth weight less than 2.5 kg. This is clinically meaningful because babies in that category tend to have health problems.
- age: Mother's age in years.
- lwt: Mother's weight in pounds at last menstrual period.
- race: Mother's race (1 = white, 2 = black, 3 = other).
- smoke: Smoking status during pregnancy.
- ptl: Number of previous premature labours.
- ht: History of hypertension.
- ui: Presence of uterine irritability.
- ftv: Number of physician visits during the first trimester.
- bwt: Baby's birth weight in grams.
Diet.xlsx: These data are from a study of people trying to lose weight. Variables are
- Person: Identification code
- gender: 0=F, 1=M
- Age: In years
- Height: In cm.
- pre.weight: Weight in kg. before starting the diet.
- Diet: 1, 2, or 3, randomly assigned
- weight6weeks: Weight in kg. after 6 weeks on the diet
The diet data are from a University of Sheffield website:
https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv
I assume they were intended to be shared.
HS-Program-Choice.data.txt: Incoming high schol students choose their programs of study. Variables are
- Gender: 0=Male, 1=Female
- Socioeconomic status: 1, 2, 3
- Math score
- Reading score
- Science score
- Social studies score
- Writing score
- Program choice: 1=general, 2=academic, 3=vocational
The program choice data are from UCLA: https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression
Rossi.data.txt: The recidivism data are from a paper by Rossi et al. (1980). I got them from the RcmdrPlugin.survival package. Convicts released from prison in Maryland in the 1970s were randomly assigned to either receive financial aid or not. They were followed for one year to see if they were re-arrested, and if so, how soon. The 62 variables are
- week: Week of first arrest after release or censoring; all censored observations are censored at 52 weeks.
- arrest: 1 if arrested, 0 if not arrested.
- fin: Financial aid: 0=no, 1=yes.
- age: Age: in years at time of release.
- race: Black or other.
- wexp: Full-time work experience before incarceration: 0=no, 1=yes.
- mar: Marital status at time of release: married or not married.
- paro: Released on parole? 0=no, 1=yes.
- prio: Number of convictions prior to current incarceration.
- educ: Level of education: 2 = 6th grade or less; 3 = 7th to 9th grade; 4 = 10th to 11th grade; 5 = 12th grade; 6 = some college.
- emp1: Employment status in the first week after release: 0=no, 1=yes.
- emp2: As above.
- . . .
- emp52: As above.
RECID.dat.txt: Rossi's recidivism data without the header, and with periods for missing values. This format is more suitable for analysis with SAS.
Rossi-ss.data.txt: Rossi's recidivism data in a start-stop format suitable for survival analysis with employment status as a time-dependent covariate. There is one line of data per week, for a total of 19,809 data lines.
liver.data.txt: In the liver disease data, patents were randomly assigned to one of two drugs, or to a placebo. The data file includes age and sex (1=F). Blood platelet count was recorded for each patient in each time period. This data set is in a start-stop format suitable for survival analysis with platelet count as a time-dependent covariate. These are simulated data.
xy.data.txt: Simulated data for simple regression through the origin. X is Poisson(3), epsilon is U(-15,15), and true beta = 2.
TinyWLS.data.txt: Another simulated data set like xy.data.txt. This time, true beta = 0.25, x is rounded U(1,5) and epsilon is normal, with variance proportional to x^2. The weighted least squares estimate of beta is mean(y/x).
titanic.data.txt: The Titanic was a passenger ship that hit an iceberg and sank on its very first voyage in 1912. It was the largest passenger ship in the world at the time, and supposedly unsinkable. More than 1,500 of the roughly 2,200 passengers and crew died. This file is from an R data set. More details are given at the beginning of the file.
selfesteem3.data.txt: This is the selfesteem2 data set from the R datarium package (with cases re-ordered to allow a multivariate data read). "Data are the self esteem score of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials. The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials. The same 12 participants are enrolled in the two different trials with enough time between trials." This means that every subject was in all 6 conditions.
airquality.data.txt: This is the R dat set airquality. It has daily readings of the following variables from May 1, 1973 to September 30, 1973, in New York City.
- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
- SolarRad: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
teengamb.data.txt: Teen gambling data set from Faraway's Linear models with R
- sex: 0=male, 1=female
- status: Socioeconomic status score based on parents' occupation
- income: in pounds per week
- verbal: verbal score in words out of 12 correctly defined
- gamble: expenditure on gambling in pounds per year
bodymind.data.txt: Educational test scores and physical measurements for a sample of high school students. This is a modified subset of data reported in the journal Human Biology. The reference is
Clark, P. J., Vandenberg, S. G., and Proctor, C. H. (1961), On the relationship of scores on certain psychological tests with a number of anthropometric characters and birth order in twins, Human Biology, 33, 163-180.
The data are used without permission, but I believe they have been modified enough so that the original copyright no longer applies, and they can be protected under a Creative Commons license. The variables are
- sex: F or M
- progmat: Progressive matrices (puzzle) score
- reason: Reasoning score
- verbal: Verbal (reading and vocabulary) score
- headlng: Head Length in mm
- headbrd: Head Breadth in mm
- headcir: Head Circumference in mm
- bizyg: Bizygomatic breadth in mm, basically how far apart the eyes are.
- weight: In pounds
- height: In cm.
Berkeley.data.txt
training.data.txt: Office workers at a large insurance company are randomly assigned to one of 3 computer use training programmes, and their number of calls to IT support during the following month is recorded. Additional information on each worker includes years of experience and score on a computer literacy test (out of 100). It is reasonable to model calls to IT support as a Poisson process, and the question is whether training programme affects the rate of the process.
deathpen3.data.txt: Prisoners who were convicted of murder in Florida were classified as either Black or White, their victims were either Black or White, and they either got the death penalty, or they did not. I have a reference for this -- American Sociological Review or something ...
choice.data.txt: In the Program Choice data, graduating grade eight students were choosing their High School program. The potential choices were Academic, General and Vocational. Predictor variables are gender, socioeconomic status, and scores on reading, writing math science standardized tests.
ltc.data.txt: LTC stands for Long Term Care. Operators of long-term care homes are very interested in whether their elderly resi- dents are going to survive, because they need to plan. In one study, the variables for a sample of residents were
- One year survival (1=Yes, 0=No)
- Age in years
- Gender (1=F, 0=M)
- Indicator for dementia (1=Yes, 0=No)
To get a Yes for dementia, it has to be pretty serious, so that the person cannot safely go outside without supervision.
lognorm1.data.txt: Right censored data from a log-normal distribution. These are simulated data, with true μ = 0 and σ² = 1.
ColonCancer.data.txt: This is a subset of the colon data set from the survival package. A sample of advanced colon cancer patients had surgery that removed all detectable cancer. Patients were randomly assigned to one of three drug treatments:
- Obs} Just observed, without any drug.
- Lev: Levamisole, a low-toxicity compound previously used to treat worm infestations in animals
- Lev+5FU: A combination of levamisole and 5-FU, a ``moderately" toxic chemotherapy agent.
The variables are
- rx: Drug treatment group
- sex: 0=Female, 1=Male
- age: Age in years
- nodes: Number of lymph nodes affected
- status: 0=Right censored, 1=Uncensored
- time: Time until censoring or recurrence of the cancer
area51.data.txt: If you spend time on the right social media sites, you will have heard of Area 51, a restricted region in the Nevada desert. It is widely believed that if you go hiking in Area 51, you have a good chance of being kidnapped by space aliens. There are indications that whether you are wearing a hat matters. To test this idea and in the interests of transparency, volunteers (there are plenty) went walking in the desert under controlled conditions.
Each day for up to 30 days, the volunteer was dropped off by helicopter at a random location in Area 51. The volunteer was either wearing a hat or not, determined by a coin toss at the beginning of each day. Area 51 is out of ordinary cell phone range, but the U. S. military has a cell phone tower there, enabling the experimenters to stay in contact with the volunteers, and to determine their exact location.
Kidnapping has a distinct signature. The volunteer's cell phone signal is suddenly interrupted. A helicopter is dispatched immediately to their last location and a search is initiated, but the volunteer is never found.
If a volunteer is not kidnapped on a particular day, then after exactly eight hours, the helicopter picks up the volunteer for transport back to the base. If the volunteer is not kidnapped within 30 days, the observation is censored. Censoring can occur earlier if the volunteer withdraws from the study, or suffers a medical emergency.
The file area51.data.txt contains a subset of the data from 200 volunteers, in a stop-start format. Variables include age, sex, an indicator for wearing a hat (which varies over time), event times, and a binary variable called taken, which equals one if a kidnapping occurred, and zero if there was no kidnapping. Event times are in minutes, with 8*60=480 minutes in an eight hour day. The clock stops at the end of each eight hour day, and re-starts when the volunteer is dropped off in the desert the next morning. Thanks to General Buck Turgidson for permission to use these data.
expseries.data.txt: Three experimental treatments and a single response variable, but the data are collected in order one case at a time, so there might be a time seies structure. These are simulated data.