Homework #6: The new statistics

Due: May 23, 2016 at 14:00.

Read and understand the following instructions on submission of homework. If you do not follow them, you will not receive credit.

Write a plain text e-mail to me. This homework does contain special symbols, but you can copy and paste them from the assignment if you need them and don't know how to input them. (It does not use special wordprocessor features. Do not attach a wordprocessor file such as Microsoft Word or a Excel spreadsheet. I will not read them.) Give the mail the subject "01CN101 Homework #6 by <your name>" in hankaku romaji and send it to math-hw@turnbull.sk.tsukuba.ac.jp. (This subject is helpful for automatically sorting incoming mail.)

Make sure that the body of the email contains your name and student ID number.

You should automatically receive an acknowledgment of your submission by email. Keep that mail. In case I lose your mail for some reason it becomes your proof of homework submission. If you don't receive an acknowledgment, you probably submitted to the wrong address.

Problems

In these problems, unless otherwise specified, you may use the cumulative distribution, the mass function, or the density function, as is convenient.

Problem 1

An airplane regularly flies from Tokyo to Sapporo. It takes off at the same time every day, but it faces varying weather conditions. The airline checks its records and computes that the equation for the time to arrive is t = 92 + ε, where ε is distributed approximately normally with mean 0 and standard deviation 4.5. Time units are minutes.

What is the domain model being used?
What is the statistical model being used?

Explain each model in some detail.

Problem 2

Here is some data about 10 college students:

Person 1 2 3 4 5 6 7 8 9 10

Gender F F F F F M F F F F

Age 23 20 21 20 19 18 20 22 18 22

Height 161.7 169.5 159.2 159.4 165.1 166.4 153.8 163.3 164.0 168.3

Weight 62.0 65.1 57.2 62.9 66.1 64.9 57.4 62.7 63.6 64.0

Person	1	2	3	4	5	6	7	8	9	10
Gender	F	F	F	F	F	M	F	F	F	F
Age	23	20	21	20	19	18	20	22	18	22
Height	161.7	169.5	159.2	159.4	165.1	166.4	153.8	163.3	164.0	168.3
Weight	62.0	65.1	57.2	62.9	66.1	64.9	57.4	62.7	63.6	64.0

Compute the joint frequency distribution of Gender and Age (not cumulative).
Compute the cumulative joint distribution of Height and Weight.
Compute a histogram of heights for 5cm ranges from 150 to 170 cm. Draw it as a bar graph.
What is the percentile rank of the man's height?
Compute the covariance of the Height and the Weight. Explain what other statistics you need to compute first.
Compute the correlation coefficient of the Height and the Weight.
The BMI of a person is that person's weight in kilograms divided by the square of their height in meters: BMI = W/H². Describe the correct way to compute the average BMI of this group. Why is it correct?

Problem 3

If the covariance of two random variables is zero, the correlation coefficient must be zero. Explain why.

Problem 4

Define outlier as used in statistics.

For the following parts, refer to the distribution depicted below:

|_xxxoxoxxxoooxoooxo___...
a                 b

The left-to-right scale is quantitative and linear, increasing toward the right. "x" and "o" denote the two types of qualitative outcome. The underscore "_" indicates that there was no observation at that level of the quantitive variable. The bar at the extreme left labelled "a" indicates a theoretical lower limit on the quantitive variable (values lower than a are impossible). The observation labelled "b" will be referred to later. The ellipsis at the right indicates that (1) there is no theoretical upper limit, and (2) the absence of observed values continues forever to the right.

Identify an outlier in the distribution above. (Choose the "worst" outlier.)
Based on "looking at the picture," give a procedure to compute the outcome to be expected given a position on the line, based on the observed distribution above. You have to predict either "x" or "o" based on the position on the line. It does not have to be based on any particular statistical concept, but do the best you can.
What does your procedure predict in the case of a new observation at the level "b"?
Explain what over-fitting (also called "over-training") is.
The "1-nearest neighbor" classifier predicts that if an individual is observed at "b", it will be an "x". Explain how this is an example of "over-fitting".
Explain how "over-fitting" can occur in calculating the mean. (Note: This question was not discussed in class, and is intended to challenge those who consider themselves adept at mathematics. Questions like this that are "difficult extensions" of class discussion will not be asked on the exam.)

Problem 5

Consider collecting data about people.

Give two examples of variables about people that are qualitative (also called categorical) but not ordered or quantitive (cardinal).
Give two examples of variables about people that are ordered (also called ordinal) but not quantitive (cardinal).
Give two examples of variables about people that are quantitative (also called cardinal).
Give the values for yourself of each variable you mentioned in parts a, b, and c.
For each type of variable (qualitative, ordered, quantitative), give one example of a statistical operation that may be performed on that kind of data, and one example of a statistical operation that should not be performed on that kind of data. (Your answer may be "none" when you believe any statistical operation is valid, respectively invalid, for that kind of data.)

Problem 6

Briefly define descriptive statistics and inferential statistics.
The purely mathematical calculations for descriptive statistics and inferential statistics are mostly the same. How can you tell when someone is doing descriptive statistics and when it is inferential statistics?
There are two kinds of empirical variance, the population variance and the sample variance. The difference is that the sample variance uses the number of observations less one to correct for bias. Is the sample variance a descriptive statistic or an inferential statistic? How do you know? (This question requires some knowledge of statistics not discussed in class, although a native speaker of English can probably get the right answer without knowing about statistics. It will not be asked on the test.)

Due: May 23, 2016 at 14:00.