Homework 1 Solutions

${\bf (1)}$ Define the following terms:

(a) Population - the collection of all possible persons, events, or objects of interest
(b) Sample - the subset of the population that we actually observe (the observed observations)
(c) Parameter - A numerical characteristic of a population (alternatively: A function of the entire population)
(d) Statistic - A numerical characteristic of a sample (alternatively: A function of the sample)

${\bf (2)}$ Explain the difference between a qualitative (categorical) and a quantitative variable

A qualitative variable is a non-numeric characteristic such as a name or label while a quantitative variable is numerical characteristic such as a measurement or count

${\bf (3)}$ Explain the difference between a discrete and a continuous variable and give an example of each

A discrete variable is a quantitative variable that takes on only distinct whole number (integer) values such as counts whereas a continuous variable is a quantitative variable that can assume any value within a certain interval such temperature or height

${\bf (4)}$ Explain the difference between a nominal and an ordinal variable and give an example of each

A nominal variable is a qualitative variable that has categories with no inherent ordering such as the names of a city or the brands or shoes. An ordinal variable is a qualitative variable that has an inherent ordering to its categories such as a Likert score or education level.

${\bf (5)}$ At What age did women marry? A historian wants to estimate the average age at marriage of women in New England in the early 19th century. Within her state archives, she finds marriage records for the years $1800 - 1820$, which she treats as a sample of all marriage records from the early 19th century. The average age of the women in the records is $24.1$ years of age. Using the appropriate statistical method, she estimates that the average age of brides in the early $19^{th}$- century New England was between $23.5$ and $24.7$ years of age.

(a) Which part of this example gives a descriptive summary of the data? When she computes a mean of $24.1$ years of age from the marriage records
(b) Which part of this example draws an inference about a population? The part where she estimates the average age of women to be married in the 19th century to be between $23.5$ and $24.7$ years for the population
(c) What population is the historian is studying? All women who married in the 19th century}} {The average age the historian computed from the historical records was 24.1 years of age.
(d) The average age the historian computed from the historical records was 24.1 years of age. Is 24.1 years of age a statistic or a parameter? Why? A statistic. The quantity $24.1$ was computed from a sample of marriage records spanning the first $20$ years of the century

${\bf (6)}$ Consider the following data obtained from flipping a coin 15 times. A value of $H$ denotes the coin was heads whereas a value of $T$ denotes the coin flip was tails.

coin.flip.result.	H	H	H	T	H	T	T	T	H	H	T	T	T	H	T

Compute the frequency table for the variable ``coin flip result” and answer the following questions:

Result	Frequency	Relative Frequency
H	7	7/15 = 0.47
T	8	8/15 = 0.53

(a) What type of variable is ``Coin Flip Result”? Qualitative nominal variable
(b) What proportion of the coin flips were heads? $0.47$ or $47\%$
(c) Why do we not compute the cumulative relative frequency for this variable? The variable is nominal and therefore has no inherent ordering for summing up the relative frequencies
(d) What the best graphical display for this variable and why? Since there are only two categories (heads or tails) a pie chart or bar graph would be sufficient to show the distribution

${\bf (7)}$ A survey about color preferences reported the age distribution of the people who responded. Below are the results


Age Group (Years)	1-18	19-24	25-25	36-60	51-69	70 and over
Counts	10	97	70	36	14	5
Relative Frequency	0.04	0.42	0.30	0.16	0.06	0.02

Use this table to answer parts a - d

(a) Compute the relative frequency for each age group
(b) Make a bar graph where the heights of the bars are relative frequencies
(c) Describe the distribution the distribution is slightly right skew indicating that most respondants were between 19 and 35 years of age
(d) Explain why your bar graph is not a histogram A histogram has intervals of equal length

${\bf (8)}$ Email spam is the curse of the internet. The table below gives a compilation of the most common types of spam

Type of Spam	Percentage
Adult	14.5
Financial	16.2
Health	7.3
Leisure	7.8
Products	21.0
Scams	14.2

Use the table to answer the questions a and b

(a) Report the modal spam category The modal category is “Products” because it has the highest frequency
(b) Construct a Pareto chart using the table above

${\bf (9)}$ A farmer in Idaho is interested in the number of rainy days in a given year so he records the number of rainy $R$, cloudy $C$ and sunny $S$ days over two weeks in $May$. His observations are \[\{R,R,C,C,C,S,S,C,R,R,C,R,S,S\}\]. Use the farmers observations to answer the following questions.

(a) Based on the farmers observations, what proportion of days were rainy? \[\hat{p}_{\text{rainy}} = \frac{5}{14} = 0.36\]
(b) Is the proportion in part (a) a statistic or a parameter? why? It is a statistic because it was computed from a sample of 14 observations of the weather
(c) What population is the farmer studying? is it a finite or infinite population? The population is all 365 days in a given year. It is finite

${\bf (10)}$ Consider the following data from a survey conducted on college students in the state of Florida. As part of their research, surveyors recorded the high school performance (measured in grade point average - GPA) of 35 college students from across the state. For your convenience the data have been sorted from least to greatest:

\[2.0, 2.1, 2.3, 2.8, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0,\] \[3.0, 3.0, 3.0, 3.1, 3.2, 3.3, 3.3, 3.4, 3.4, 3.4,\] \[3.4, 3.5, 3.5, 3.5, 3.5, 3.6, 3.6, 3.7, 3.7, 3.8,\] \[3.8, 3.8, 3.8, 4.0, 4.0\]

The following frequency table gives the distribution of the variable high school GPA (with values rounded to nearest 0.5). Fill in the table and answer the following questions:

GPA	Frequency	Relative frequency	Cumulative RF
2.0	2	0.06	0.06
2.5	1	0.03	0.09
3.0	12	0.34	0.43
3.5	14	0.40	0.83
4.0	6	0.17	1.00

(a) What type of variable is GPA? Quantitative continuous
(b) What proportion of college students had a high school GPA $> 3.0$? 0.57
(c) What proportion of college students had a high school GPA $< 3.0$? 0.09
(d) What is the mean GPA in this sample? \[\bar{x} = \frac{2.0+2.1+...+4.0+4.0}{35} = \frac{2(2)+1(2.5)+...+6(4)}{35} = 3.3\]
(e) What is the median GPA in this sample? median $= 3.4$
(f) What is the mode of GPA in this sample?

Using the frequency table on the rounded values the mode is the value with the highest relative frequency $ = 3.5$

(Using the raw data) the mode is the most frequent value observed in the data which is $3.0$

(g) Construct a dot plot for the variable GPA (hint use the frequency table)
1. Dot plot using the rounded GPA values from the frequency table (B) Dot plot using the raw GPA values
(h) Construct a histogram for the variable GPA (hint use the frequency table)
1. Histogram using the rounded GPA values (e.g 4 total bins) from the frequency table

${\bf (11)}$ Which statistic is more resistant to outliers, the mean or median? Why?

The median is resistant to outliers because it relies on only the middle value (or middle two values if $n$ is even). Therefore, how far a value is from the center does not influence the median. The mean, however, is computed using values in a sample distribution. This causes the mean to be pulled in the direction of extreme values.

${\bf (12)}$ Describe the shape of the following distributions and for each distribution identify if the mean will be larger, smaller or the same as the median.

(a) Skew right. The mean will be greater than median because the center of the data is pulled toward the right side of the distribution

(c)
Skew left. The mean will be less than median because the center of the data is pulled toward the left side of the distribution

${\bf (13)}$ Consider the following four sets of observations of a quantitative variable $x$. For your convenience the observations have been sorted in increasing order. Match datasets $1-4$ with the correct histogram (labeled $A - D$)

\[\text{Dataset 1} = \{0.1, 1.1, 2.6, 2.7, 3.4, 3.4, 4.1, 4.4, 8.8, 9.6\}\] \[\text{Dataset 2}= \{0.1, 0.3, 1.2, 2.4, 4.4, 4.5, 8.0, 8.9, 9.3, 9.3\}\] \[\text{Dataset 3} = \{1.1, 3.8, 5.3, 6.0, 6.2, 6.9, 7.9, 7.9, 8.1, 8.7\}\] \[\text{Dataset 4} = \{3.4, 4.5, 5.4, 5.6, 7.0, 8.5, 8.9, 9.2, 9.7, 9.7\}\]

${\bf (14)}$ Consider the following set of 10 observations of a variable $X$ sorted from least to greatest: \[3.3, 3.8, 4.0, 4.8, 4.8, 5.1, 5.2, 5.6, 5.7, 6.9\] Use the data to answer parts a-b

${\bf (15)}$ Consider the following $n = 20$ observations of the sugar and sodium content of several popular cereal brands and answer questions a - g:

Brand	Sodium (mg)	Sugar (g)	Type
Frosted Mini Wheats	0	11	A
Raisin Bran	340	18	A
All Bran	70	5	A
Apple Jacks	140	14	C
Cap’n Crunch	200	12	C
Cheerios	180	1	C
Cinnamon Toast Crunch	210	10	C
Crackling Oat Bran	150	16	A
Fiber One	100	0	A
Frosted Flakes	130	12	C
Froot Loops	140	14	C
Honey Bunches of Oats	180	7	A
Honey Nut Cheerios	190	9	C
Life	160	6	C
Rice Krispies	290	3	C
Honey Smacks	50	15	A
Special K	220	4	A
Wheaties	180	4	A
Corn Flakes	200	3	A
Honeycomb	210	11	C

Sugar (g)	Frequency	RF(x)	CRF(x)
< = 2 (g)	2	0.10	0.10
2 - 4 (g)	4	0.20	0.30
4 - 6 (g)	2	0.10	0.40
6 - 8 (g)	1	0.05	0.45
8 - 10 (g)	2	0.10	0.55
10 - 12 (g)	4	0.20	0.75
12 - 14 (g)	2	0.10	0.85
14 - 16 (g)	2	0.10	0.95
> 16 (g)	1	0.05	1.00