データ解析基礎 - Basic Data Analysis

中間試験 - Midterm Examination

May 24, 2012

Problems and answers for Data Set 46

When you are asked to do a calculation, you do not need to compute the decimal equivalent of a fraction or radical (square root). Fractions should be reduced to lowest terms for convenience in grading. Radicals do not need to be reduced.

計算を行うときには分数または根数のままを書いてもよい。少数にする必要はない。ただし、分数の分母と分子は互いに素にすること。

For Problems 1 to 6, use Data Set A. (Each student receives a different data set. Make sure your Data Set ID is correctly entered in the space at the top of the page.) Data Set A is a data set of rice yield (productivity) in metric tonnes per hectare in Japan. Classify productivity according to the following subjective yield scale: x <= 5 is “poor,” 5 < x <= 6.5 is “average,” 6.5 < x <= 8 is “good,” and x > 8 is “excellent.”

問題１〜６にデータセットAを利用してください。（注意：皆に別のデータを用意する。必ずデータセットIDを確認すること。）データセットAは日本の米の生産性（t/ha）である。サイズを「主観的生産性」に以下の表によって区別する：x <= 5は「わるい」、5 < x <= 6.5は「平均」、6.5 < x <= 8は「よい」、そして x > 8は「優秀」。

Copy your data set in the space below:

ここにデータを写ってください：

[7.4, 7.0, 7.5, 7.4, 6.8, 6.5, 6.7, 7.5, 6.4, 6.5]

Sort your raw data, and write the sorted data here.

データを順序に並べてここに書くこと。
```
[6.4, 6.5, 6.5, 6.7, 6.8, 7.0, 7.4, 7.4, 7.5, 7.5]
```
Convert your data to “subjective yields,” and enter the absolute, relative, and cumulative relative frequency distributions here. What is the median of your data? データを「主観的生産性」に変換して絶対頻度分布、相対頻度分布、または（相対）累積分布を書け。中央値（メディアン）はどれですか？
```
     Label      Up to      Value  Frequency   Relative        CDF
      poor          5       4.25          0        0.0        0.0
   average        6.5       5.75          3        0.3        0.3
      good          8       7.25          7        0.7        1.0
 very fast   infinity       8.75          0        0.0        1.0

  
```
The median is "['good']."
For each “subjective yield,” choose a representative numerical yield. Enter the correspondence here. Explain why you chose each numerical value. 各「主観的生産性」に代表的値を選んで、ここに記入。それぞれの値を選んだ理由を説明せよ。
The values are entered in the table above. Reason: For the middle ranges, the middle of the range is chosen. For the end ranges (which are actually open-ended), the distance from the middle values is chosen to make the respresentative values evenly space (all differences are the same).
Using the representative yields you chose in the previous problem, compute the mean, variance, and standard deviation of the distribution of yields. Show your work (e.g., using a table like that in Homework 2's spreadsheet). 前の問題で選んだ代表的生産性と分布を用いてサイズ分布の平均値、分散、と標準偏差を計算せよ。計算方法を表すテーブルなどを含むこと。（たとえ、第２宿題のシートのようなもの。）
```
     Label          x       f(x)      xf(x)     x - mu   (x-mu)^2  (x-mu)^2f(x)
      poor       4.25       0.00     0.0000      -2.55       6.50     0.0000
   average       5.75       0.30     1.7250      -1.05       1.10     0.3307
      good       7.25       0.70     5.0750       0.45       0.20     0.1418
 very fast       8.75       0.00     0.0000       1.95       3.80     0.0000
   moments                           6.8000                           0.4725
```
The mean is μ=6.8, the variance is σ^2=0.47250000000000003, and the standard deviation is σ=0.687386354243376.
Pick one of the following three cases, and answer the question in the space provided below. 以下の状況説明から一つを選んで、下記の問を答えろ。
1. The data set of yields was derived from the 2010 harvests in 10 different fields in one town operated by a particular farmer using the same seed in each field. 生産性データはある農家の２０１０年の収穫を田んぼ別にした。ただし、全ての田んぼには同じ米の種類を蒔いた。
  Note that your answer to this kind of question should not depend on the data! This case is relatively difficult. One likely "hidden" factor is that the farmer knows about the productivity of the different fields, so even though he uses the same seed, he may put different amounts of effort into cultivating each field according to productivity. Another one is that the fields probably have different levels of exposure to animals, disease, and weather, so the losses from these causes will vary. Both of these factors suggest that the distribution of yields differs for each field.
2. The data set of yields was derived from the 2010 harvest for one representative farm in each of 10 prefectures in eastern Japan. 生産性データは関東の１０ヵ都県の代表農家の２０１０年の収穫で計算した。
  In this case, although the problem states the farms are representative for each prefecture, 2010 might not be a representative year. For major weather disasters such as typhoons, even though the farms are in ten different prefectures it seems likely that losses from weather conditions would be correlated (the observations are not independent).
3. The yield data was compiled for a particular farm in Tsukuba for each year from 2001 to 2010. 生産性データはあるつくば市にある農家の２００１〜２０１０年度の収穫データで計算した。
  In this case, we don't know that the farm was representative of other farms, or what aspects of management may have changed. If the farm is unrepresentative, its time series data might not be useful to assess trends at other farms. If management has changed, such as an older farmer retiring and passing on management to a successor, that would likely change productivity according to the skill of the new manager in ways that would not apply to other farms.
For your chosen case, give one example of a “hidden factor” relating different observations in the data set that could affect the way you interpret your statistics. Explain why this matters. 選択状況には「隠された要因」により観察間関係が現れ、統計量の解釈に影響を及ぼすことがある。その要因・関係をひとつを選んで説明せよ。
Sample answers are provided above with each variant.
For Problems 6 to 7, use Data Set B. (Each student receives a different data set.) Data Set B is a data set of examination scores on a 0-100 scale. 問題6〜7にデータセットBを利用してください。（注意：皆に別のデータを用意する。必ずデータセットIDを確認すること。）データセットBはある試験の点数データで、0〜100の範囲である。 Copy your data set here: ここにデータを写ってください：
```
[67, 66, 74, 73, 92, 62, 60, 47, 65, 81]
```
Give the definition of median. Find the median of the raw data from Data Set B. Now, convert Data Set B to letter grades according to the usual scale, and enter a table containing the letter grade, the scale　interval, the absolute frequency, the relative frequency, and the cumulative frequency distribution. What is the median of the distribution of letter grades? Compare it to the raw (point score) median. 中央値（メディアン）の定義を書け。データセットBの中央値を記入せよ。データセットBを普通のスケールでレターグレードに変換し、レターグレード、スケール範囲、絶対同数、相対同数、と（相対）累積頻度分布を表に記入すること。レターグレード分布の中央値を求めよ。点数のメディアンと比較せよ。
The median is the 50-percentile value, that is, the value such the 50% of the distribution is larger than this value and 50% is smaller. Sorting the data set gives [47, 60, 62, 65, 66, 67, 73, 74, 81, 92], so the median is some number between 66 and 67. Some statisticians report both (the median values are 66 and 67), others take the average (the median value is 66.5). The letter grade distribution is
Label Up to Value Frequency Relative CDF D 60 55 2 0.2 0.2 C 70 65 4 0.4 0.6 B 80 75 2 0.2 0.8 A 100 85 2 0.2 1.0
The median of the letter grades is ['C']. This does not correspond accurately to the letter grade corresponding to the median point score.
Draw a histogram for the raw data set. Drawing a histogram involves a choice of division into cells of values. (Recall that a cell is a group of values that are close to each other.) Explain why you chose the cells you did. 点数データのヒストグラムを描け。ヒストグラムの作成には値の区間（セル、仕切り）の選択が必要だ。（区間は値の範囲だ。）区間の選択の理由を説明せよ。
```
     |XXX|          
     |XXX|          
|XXX||XXX||XXX||XXX|
|XXX||XXX||XXX||XXX|
--------------------
```
The Public Health Service studied the effects of smoking on health, in a large representative sample of households. They split the sample by gender and by age groups, then compared health of individuals within the same group who had different smoking histories. For both men and women, in all age groups, they found that those who had never smoked were on average somewhat healthier than current smokers. But the current smokers were much healthier than those who had recently stopped smoking. 厚生省が喫煙の健康への影響を調査するために多くの代表的家計を選び、男女・年齢別のグループに分けた。各グループのメンバーの喫煙状況を調べ、健康状態を比べた。男性でも、女性でも、各年齢でも、同じ結果が出た。平均的に喫煙歴史のない人は現在喫煙する人よりやや健康がよかったが、現在喫煙する人は最近タバコを止めた人より明らかに健康がよかったと言う。
1. Is this an observational study or an experimental study? Briefly explain how you know. この調査は観察的調査であるか、実験的調査ですか。その理由を簡単に説明せよ。
  This is purely an observational study. An experiment would require telling some people to smoke and others to stop smoking. That was not done.
2. Why did they study men and women, and the different age groups, separately? 男女・年令別で調査を行った理由を説明せよ。
  Probably because these variables modify the effect that smoking has on health. By studying each group separately, the study was able to compare similar people who differed only in their smoking history. (Note that this does not help control for unrecorded factors that might be correlated with health risks of smoking, such as obesity.)
3. The lesson seems to be that you shouldn't start smoking, but once you've started, you shouldn't stop. What do you think? この研究が教えることは「タバコを始めない方がよいが、吸う習慣になった場合には止めてはだめだ」でしょう。あなたの考えかたを説明しろ。
  That is not a valid inference from this data. The problem is that it is very hard to stop smoking; one needs a strong incentive. Being told by your doctor that smoking has destroyed your health and you must stop or you will die is much more motivating than if you are healthy and your advises you to stop. Probably many of those who recently stopped had such motivation, and therefore were less healthy than those who "happily" continue to smoke.
A coin is tossed six times. Two (of the many) possible sequences of results are
```
    (i) H T T H T H      (ii) H H H H H H
```
(The coin must land H or T in the order given; H = heads, T = tails.) Which of the following statements is correct? Explain briefly. 硬貨を６回なげる。数多くの結果の順序の中で
```
    (i) H T T H T H      (ii) H H H H H H
```
の２つがある。（硬貨は書いた通りＨまたはＴにならなければならない。Ｈ＝表、Ｔ＝裏。）以下のケースの中から正しいのはどれであるか。その理由を簡単に説明せよ。
1. Sequence (i) is more likely. 順序 (i) の確立が高い。
2. Sequence (ii) is more likely. 順序 (ii) の確立が高い。
3. Both sequences are equally likely. 両順序の確立が等しい。
The correct answer is "both sequences are equally likely" because the tosses are independent. Since both heads and tails have the same probability of 1/2, each sequence has probability 1/(2*2*2*2*2*2) = 1/64.
Suppose you pick a child at random from an elementary school. Are the events ``the child is in 2d grade'' and ``the child is female'' independent? Are they mutually exclusive? Explain. 小学校の１人の生徒をランダムに選ぶ。「２年生である」と「女性である」という事象を定義する。２つの事象は「independent」ですか。「mutually exclusive」ですか。その理由を説明すること。
The precise answer is, "We don't know, because it depends on the school." Let A = the event "the child is in 2d grade" and B = the event "the child is female". The two events are independent if P(A and B) = P(A)P(B). Since the child is picked at random, P(A) = (number of 2d grade students) / (number of all students), and P(B) = (number of female students) / (number of all students). Now consider the case (unusual, but I know of such a school!) where all of the second grade students are male (but there are some girls in other grades). Then P(A and B) = 0, but P(A)P(B) > 0. So they are not independent. On the other hand, in this case A and B are mutually exclusive. Of course normally we expect male and female to be half and half in all grades, so A and B should be "close" to independent, and similarly they would not be mutually exclusive. But the example shows that we can't be sure.
The leading general trading companies are generally considered to hire some of the best students in Japan. Take a new employee at random, and consider the events A = “the student's school was the University of Tsukuba,” and B = “the student has a TOEIC score above 750.” State as many facts as you can about Pr({}), Pr(A), Pr(B), Pr(A ∩ B), Pr(A ∪ B), and Pr(Ω), including comparing the probabilities of two events (e.g., Pr(A) < Pr(Ω)). 総合商社は日本のトップランナーを雇うと言われている。ランダムに新社員のひとりを選出し、A=「筑波大学出身」とB=「TOEICが７５０点以上」という事象を考察しよう。Pr({})、Pr(A)、Pr(B)、 Pr(A ∩ B)、Pr(A ∪ B)、Pr(Ω) についてできるだけ多くの事実を書け。事象の確率の比較を含む。（例： Pr(A) < Pr(Ω)。）
Pr({}) <= Pr(A) <= Pr(A ∪ B) <= Pr(Ω)
Pr({} <= Pr(B) <= Pr(A ∪ B) <= Pr(Ω)
Pr(A ∩ B) <= Pr(A) <= Pr(Ω)
Pr(A ∩ B) <= Pr(B) <= Pr(Ω)