Homework #6: The new statistics
Due: May 23, 2016 at 14:00.
Read and understand the following instructions on submission of
homework. If you do not follow them, you will not receive credit.
Write a plain text e-mail to me. This homework does contain special
symbols, but you can copy and paste them from the assignment if you
need them and don't know how to input them. (It does not use special
wordprocessor features. Do not attach a wordprocessor file such as
Microsoft Word or a Excel spreadsheet. I will not read them.) Give
the mail the subject "01CN101 Homework #6 by <your name>" in hankaku
romaji and send it to math-hw@turnbull.sk.tsukuba.ac.jp. (This
subject is helpful for automatically sorting incoming mail.)
Make sure that the body of the email contains your name and student
ID number.
You should automatically receive an acknowledgment of your submission
by email. Keep that mail. In case I lose your mail for some reason
it becomes your proof of homework submission. If you don't receive an
acknowledgment, you probably submitted to the wrong address.
Problems
In these problems, unless otherwise specified, you may use the
cumulative distribution, the mass function, or the density function,
as is convenient.
Problem 1
An airplane regularly flies from Tokyo to Sapporo. It takes off at
the same time every day, but it faces varying weather conditions. The
airline checks its records and computes that the equation for the time
to arrive is t = 92 + ε, where ε is distributed approximately
normally with mean 0 and standard deviation 4.5. Time units are
minutes.
- What is the domain model being used?
- What is the statistical model being used?
Explain each model in some detail.
Problem 2
Here is some data about 10 college students:
Person |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Gender |
F |
F |
F |
F |
F |
M |
F |
F |
F |
F |
Age |
23 |
20 |
21 |
20 |
19 |
18 |
20 |
22 |
18 |
22 |
Height |
161.7 |
169.5 |
159.2 |
159.4 |
165.1 |
166.4 |
153.8 |
163.3 |
164.0 |
168.3 |
Weight |
62.0 |
65.1 |
57.2 |
62.9 |
66.1 |
64.9 |
57.4 |
62.7 |
63.6 |
64.0 |
- Compute the joint frequency distribution of Gender and Age (not
cumulative).
- Compute the cumulative joint distribution of Height and Weight.
- Compute a histogram of heights for 5cm ranges from 150 to 170 cm.
Draw it as a bar graph.
- What is the percentile rank of the man's height?
- Compute the covariance of the Height and the Weight.
Explain what other statistics you need to compute first.
- Compute the correlation coefficient of the Height and the Weight.
- The BMI of a person is that person's weight in kilograms divided
by the square of their height in meters: BMI = W/H². Describe the
correct way to compute the average BMI of this group. Why is it
correct?
Problem 3
If the covariance of two random variables is zero, the correlation
coefficient must be zero. Explain why.
Problem 4
- Define outlier as used in statistics.
For the following parts, refer to the distribution depicted below:
|_xxxoxoxxxoooxoooxo___...
a b
The left-to-right scale is quantitative and linear, increasing toward
the right. "x" and "o" denote the two types of qualitative outcome.
The underscore "_" indicates that there was no observation at that
level of the quantitive variable. The bar at the extreme left
labelled "a" indicates a theoretical lower limit on the quantitive
variable (values lower than a are impossible). The observation
labelled "b" will be referred to later. The ellipsis at the right
indicates that (1) there is no theoretical upper limit, and (2) the
absence of observed values continues forever to the right.
- Identify an outlier in the distribution above. (Choose the
"worst" outlier.)
- Based on "looking at the picture," give a procedure to compute the
outcome to be expected given a position on the line, based on the
observed distribution above. You have to predict either "x" or
"o" based on the position on the line. It does not have to be
based on any particular statistical concept, but do the best you
can.
- What does your procedure predict in the case of a new observation
at the level "b"?
- Explain what over-fitting (also called "over-training") is.
- The "1-nearest neighbor" classifier predicts that if an individual
is observed at "b", it will be an "x". Explain how this is an
example of "over-fitting".
- Explain how "over-fitting" can occur in calculating the mean.
(Note: This question was not discussed in class, and is intended
to challenge those who consider themselves adept at mathematics.
Questions like this that are "difficult extensions" of class
discussion will not be asked on the exam.)
Problem 5
Consider collecting data about people.
- Give two examples of variables about people that are qualitative
(also called categorical) but not ordered or quantitive (cardinal).
- Give two examples of variables about people that are ordered
(also called ordinal) but not quantitive (cardinal).
- Give two examples of variables about people that are quantitative
(also called cardinal).
- Give the values for yourself of each variable you mentioned in
parts a, b, and c.
- For each type of variable (qualitative, ordered, quantitative),
give one example of a statistical operation that may be
performed on that kind of data, and one example of a statistical
operation that should not be performed on that kind of data.
(Your answer may be "none" when you believe any statistical
operation is valid, respectively invalid, for that kind of data.)
Problem 6
- Briefly define descriptive statistics and inferential
statistics.
- The purely mathematical calculations for descriptive statistics
and inferential statistics are mostly the same. How can you tell
when someone is doing descriptive statistics and when it is
inferential statistics?
- There are two kinds of empirical variance, the population variance
and the sample variance. The difference is that the sample
variance uses the number of observations less one to correct for
bias. Is the sample variance a descriptive statistic or an
inferential statistic? How do you know? (This question requires
some knowledge of statistics not discussed in class, although a
native speaker of English can probably get the right answer
without knowing about statistics. It will not be asked on the
test.)
Due: May 23, 2016 at 14:00.