Define the following terms/concepts
a.) Describe the central limit theorem and its significance to the sampling distribution for a sample proportion or mean - The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that the sampling distribution of the sample mean or sample proportion approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution, as long as the sample size is sufficiently large.
b.) Margin of error - The margin of error is a measure of the uncertainty or precision associated with estimating a population parameter (such as a population mean or proportion). It quantifies the amount by which a sample estimate might deviate from the true population parameter.
c.) Describe the “basic form” of a confidence interval - Point estimate \(\pm\) Standard score \(\times\) Standard error which can also be written as Point estimate \(\pm\) Margin of error
d.) Confidence level - the probability that the interval will contain the true parameter value before the data are gathered
e.) Critical value - The point on the distribution of the test statistic that defines a set of values that call for rejecting the null hypothesis - i.e the value which defines the rejection region of a test
f.) When are \(t\)-distribution and standard normal distribution approximately the same shape? - The \(t\)-distribution and the standard normal distribution are approximately the same shape when the sample size is large, typically considered to be greater than 30.
g.) How are the degrees of freedom of a \(t\)-distribution related to its shape? - As the degrees of freedom increase, the \(t\)-distribution becomes more similar to the standard normal distribution, with lighter tails.
h.) Significance level - The significance level is a predetermined threshold used in statistical hypothesis testing to determine the level of evidence required to reject the null hypothesis
i.) Rejection region - The area in the tail (or tails) of the distribution of the test statistic for which a result is deemed statistically significant
\(\bf (1)\) Use the appropriate table to find the probability of each \(z\) or \(t\) score.
a.) \(P(Z > 1.3)\) \(=1-P(Z\leq 1.3) = 0.097\)
b.) \(P(Z \leq 0.33)\) \(=0.629\)
c.) \(P(Z \leq -2.1)\) \(=0.018\)
d.) \(P(t > 1.97)\) with 5 degrees of freedom \(= 1 - P(t \leq 1.97) = 0.053\)
e.) \(P(t \leq 2.9)\) with 2 degrees of freedom \(= 0.949\)
f.) \(P(t \geq -0.98)\) with 8 degrees of freedom \(= 1 - P(t < -0.98) = 0.822\)
\(\bf (2)\) Fill in the table below with the corresponding \(\alpha\) level and critical value for each confidence interval for a population proportion.
Confidence Level | \(\alpha\) | critical value: \(z_{1-\alpha/2}\) |
---|---|---|
\(90\%\) | \(\color{red}{0.1}\) | \(\color{red}{1.645}\) |
\(88\%\) | \(\color{red}{0.12}\) | \(\color{red}{1.555}\) |
\(97\%\) | \(\color{red}{0.03}\) | \(\color{red}{2.17}\) |
\(96\%\) | \(\color{red}{0.04}\) | \(\color{red}{2.054}\) |
\(78\%\) | \(\color{red}{0.22}\) | \(\color{red}{1.227}\) |
\(99\%\) | \(\color{red}{0.01}\) | \(\color{red}{2.576}\) |
\(\bf (3)\) Fill in the table below with the corresponding \(\alpha\) level and critical value for each confidence interval for a population mean.
Confidence Level | Sample size | \(\alpha\) | critical value: \(t_{n-1, 1-\alpha/2}\) |
---|---|---|---|
\(92\%\) | \(n = 5\) | \(\color{red}{0.08}\) | \(\color{red}{2.333}\) |
\(86\%\) | \(n = 8\) | \(\color{red}{0.14}\) | \(\color{red}{1.664}\) |
\(98\%\) | \(n = 11\) | \(\color{red}{0.02}\) | \(\color{red}{2.764}\) |
\(99.9\%\) | \(n = 14\) | \(\color{red}{0.001}\) | \(\color{red}{4.221}\) |
\(82\%\) | \(n = 40\) | \(\color{red}{0.18}\) | \(\color{red}{1.365}\) |
\(95\%\) | \(n = 13\) | \(\color{red}{0.05}\) | \(\color{red}{2.179}\) |
\(\bf (4)\) A survey conducted by the state of Georgia surveyed colleges students as part of a larger effort to characterized student behavior (Agresti and Franklin 2007). One the questions in the survey asked students to report their political party affiliation. The table below summarizes the proportion of students who reported being Republican, Democrat, or Independent.
Political Party Affiliation | Count | \(\hat{p}\) |
---|---|---|
Democrat | 8 | 0.14 |
Republican | 36 | 0.61 |
Independent | 15 | 0.25 |
Using the table above, construct a \(90\%\) confidence interval for the proportion of Independent college student voters in the state of Georgia.
sample.size = (8+36+15)
estimate = 15/sample.size
confidence.lvl = 0.9
CI = one.sample.prop.CI(phat = estimate,
n = sample.size,
alpha = 1-confidence.lvl,
verbose = T)
## [1] --------------------------------------------------
## [1] Estimate = 0.2542
## [1] Critical Value = 1.6449
## [1] Estimated Standard Error = 0.0567
## [1] Margin of error = 0.0932
## [1] 90 % CI = [0.161,0.3475]
## [1] --------------------------------------------------
\(\bf (5)\) As part of an effort to improve health services, a health club in the United Kingdom asked club members to complete an online survey (de Vries 2023). One of the survey questions asked customers to rate their satisfaction level on a likert scale. The results of this survey question are summarized below
Response | Count | \(\hat{p}\) |
---|---|---|
Completely dissatisfied | 0 | 0.00 |
Very dissatisfied | 0 | 0.00 |
Dissatisfied | 3 | 0.01 |
Neutral | 20 | 0.09 |
Satisfied | 64 | 0.30 |
Very satisfied | 109 | 0.51 |
Completely satisfied | 19 | 0.09 |
Using the table above, construct a \(95\%\) confidence interval for the proportion of customers that are “very satisfied” or more.
sample.size = (3+20+64+109+19)
very.satisfied = 109
completely.satisfied = 19
estimate = (very.satisfied+completely.satisfied)/sample.size
confidence.lvl = 0.95
CI = one.sample.prop.CI(phat = estimate,
n = sample.size,
alpha = 1-confidence.lvl,
verbose = T)
## [1] --------------------------------------------------
## [1] Estimate = 0.5953
## [1] Critical Value = 1.96
## [1] Estimated Standard Error = 0.0335
## [1] Margin of error = 0.0656
## [1] 95 % CI = [0.5297,0.661]
## [1] --------------------------------------------------
\(\bf (6)\) The following table summarizes variables from a random sample of \(25\) observations from a dataset concerning housing in California taken from the 1990 census. The full dataset is featured in Aurélien Géron’s book ‘Hands-On Machine learning with Scikit-Learn and TensorFlow’ (Géron 2022). Use the table to answer parts (a) and (b)
Variable | Sample Mean | Sample Standard Deviation |
---|---|---|
Median Age of House Owner | 24.28 | 13.56 |
Total Rooms (per block) | 3263.72 | 2105.99 |
Total Bedrooms (per block) | 640.52 | 380.55 |
Population of District | 1861.68 | 1193.30 |
Number of Households (per block) | 584.24 | 330.38 |
Median Family Income (in tens of thousands of USD) | 3.74 | 1.54 |
Median House value (USD) | 179320.04 | 109450.56 |
sample.size = 25
sample.mean = 3.74*10000 #converting estimate to USD
sample.sd = 1.54*10000 #converting standard deviation to USD
confidence.lvl = 0.95
CI = one.sample.t.CI(xbar = sample.mean,
s = sample.sd,
n = sample.size,
alpha = 1 - confidence.lvl,
verbose = T)
## [1] --------------------------------------------------
## [1] Estimate = 37400
## [1] Critical Value = 2.0639
## [1] Estimated Standard Error = 3080
## [1] Margin of error = 6356.8076
## [1] 95 % CI = [31043.1924,43756.8076]
## [1] --------------------------------------------------
# At the 95% confidence level, we estimate the average median family income in California in 1990 to be no less than $31,043 USD and no more than $43,756 USD.
sample.size = 25
sample.mean = 640.52
sample.sd = 380.55
confidence.lvl = 0.99
CI = one.sample.t.CI(xbar = sample.mean,
s = sample.sd,
n = sample.size,
alpha = 1 - confidence.lvl,
verbose = T)
## [1] --------------------------------------------------
## [1] Estimate = 640.52
## [1] Critical Value = 2.7969
## [1] Estimated Standard Error = 76.11
## [1] Margin of error = 212.8751
## [1] 99 % CI = [427.6449,853.3951]
## [1] --------------------------------------------------
# At the 99% confidence level, we estimate the average number of bedrooms per block in California in 1990 to be no less than 427.6 and no more than 853.4
\(\bf (7)\) According to a 2023 study published by the Saudi Heart Association, a survey of college students from Taibah University in Medina, Saudi Arabia, found that \(24\%\) of students reported using e-cigarettes (Alzahrani et al. 2023). Assuming a comparable rate of e-cigarette use in the U.S., what sample size is required to estimate the proportion of college students who use e-cigarettes at the \(95\%\) confidence level and within a margin of error of \(1.5\%\)?
m = 0.015 #desired margin of error
p = 0.24 #expected proportion of e-cigarette use based on Saudi Study
confidence.lvl = 0.95
alpha = 1 - confidence.lvl
z = qnorm(1 - alpha/2) #standard score for confidence level of 95%
n = round((z^2*p*(1-p))/m^2) # compute sample size - round to nearest whole number
# a sample size of approximately
print(paste0('n = ', n, ' observations'))
## [1] "n = 3114 observations"
#are needed to achieve the desire confidence level and margin of error
\(\bf (8)\) Recent concerns over the rise in global infertility has drawn significant attention to the harmful effects of microplastics, PFAS, and highly processed foods (Zhang et al. 2022; Agarwal et al. 2015). A study from 2002 published in found that approximately \(15\%\) of couples worldwide experienced infertility (Sharlip et al. 2002). Assuming a similar rate of infertility in the U.S, compute the sample size needed to estimate infertility of U.S couples at the \(99\%\) confidence level and within a margin of error of \(4\%\).
m = 0.04 #desired margin of error
p = 0.15 #expected proportion of infertile couples based on previous studies
confidence.lvl = 0.99
alpha = 1 - confidence.lvl
z = qnorm(1 - alpha/2) #standard score for confidence level of 99%
n = round((z^2*p*(1-p))/m^2) # compute sample size - round to nearest whole number
# a sample size of approximately
print(paste0('n = ', n, ' observations'))
## [1] "n = 529 observations"
#are needed to achieve the desire confidence level and margin of error
\(\bf (9)\) Describe the assumptions associated with estimating \(p\) with \(\hat{p}\)
Random Sampling: The sample must be selected randomly from the population of interest. Random sampling ensures that each member of the population has an equal chance of being included in the sample, allowing for unbiased estimation of the population proportion.
Independence: The observations in the sample must be independent of each other. Independence means that the occurrence or value of one observation does not influence the occurrence or value of another observation in the sample.
Large Sample Size (for Asymptotic Normality): For certain statistical methods, such as constructing confidence intervals or conducting hypothesis tests, the sample size should be sufficiently large. While there is no fixed threshold for what constitutes “large enough,” a common rule of thumb is that \(np\geq 15\) and \(n(1-p)\geq 15\) to ensure the validity of results.
\(\bf (10)\) Describe the assumptions associated with estimating \(\mu\) with \(\bar{x}\)
Random Sampling: The sample must be selected randomly from the population of interest. Random sampling ensures that each member of the population has an equal chance of being included in the sample, allowing for unbiased estimation of the population proportion.
Independence: The observations in the sample must be independent of each other. Independence means that the occurrence or value of one observation does not influence the occurrence or value of another observation in the sample.
Normality of Population Distribution (for Small Sample Sizes): For small sample sizes (typically less than 30), it is assumed that the population from which the sample is drawn follows a normal distribution. This assumption is necessary for the validity of certain statistical inference methods, such as constructing confidence intervals or conducting hypothesis tests. However, the Central Limit Theorem suggests that for larger sample sizes, the distribution of the sample mean approaches normality regardless of the shape of the population distribution.
\(\bf (11)\) A pharmaceutical company produces a generic version of the pain reliever ibuprofen, marketing a tablet with a 200 milligram dose. Concerned about the accuracy of the dosing process, the manufacturer suspects that the machine filling the tablets may be malfunctioning leading to a smaller dose in each tablet. The manufacturer wishes to conduct a significance test to determine if the dose is significantly lower than 200 milligrams. State the null and alternative hypotheses for this test
The null hypothesis is that the mean dose of a tablet is 200 mg \[ H_0: \mu_0 = 200\] The alternative hypothesis is that the mean dose of a tablet is less than 200 mg \[ H_A: \mu < \mu_0 \]
\(\bf (12)\) A financial institution invests in a pharmaceutical company, which produces a widely prescribed antidepressant medication with a market value of $100 per share. Worried about potential discrepancies in the market valuation, the institution suspects that recent market trends may have led to an underestimation of the stock’s value. The institution plans to conduct a significance test to determine if the stock’s value has significantly increased. State the null and alternative hypotheses for this test.
The null hypothesis is that the mean share value is \(\$100\) \[ H_0: \mu_0 = 100\] The alternative hypothesis is that the mean share value is greater than \(\$100\) \[ H_A: \mu > \mu_0 \]
\(\bf (13)\) A semiconductor manufacturing company produces microchips with a target defect rate of 0.1% per batch. Concerned about the quality control process, the company suspects that recent changes in manufacturing procedures may have resulted in a change in the defect rate. The company intends to conduct a significance test to determine if the defect rate is significantly different than the target rate of 0.1%. State the null and alternative hypotheses for this test.
The null hypothesis is that the proportion of defective semi-conductors is \(0.001\) \[ H_0: p_0 = 0.001\] The alternative hypothesis is that that the proportion of defective semi-conductors is not equal to \(0.001\) \[ H_A: p \neq p_0 \]
\(\bf (14)\) For each set of hypotheses, significance level, and \(p\)-value. State whether the test rejects or fails to reject the null hypothesis.
a.) \(H_0: \mu = 0\); \(H_A: \mu \neq 0\); \(\alpha = 0.01\); \(p\)-value \(= 0.0098\) Decision: reject \(H_0\) because \(p\)-value \(<\alpha\)
b.) \(H_0: p = 0.5\); \(H_A: p > 0.5\); \(\alpha = 0.05\); \(p\)-value \(= 0.086\) Decision: fail to reject \(H_0\) because \(p\)-value \(>\alpha\)
c.) \(H_0: \mu = 100\); \(H_A: \mu < 100\); \(\alpha = 0.001\); \(p\)-value \(= 0.0015\) Decision: fail to reject \(H_0\) because \(p\)-value \(>\alpha\)
d.) \(H_0: p = 0.9\); \(H_A: p \neq 0.9\); \(\alpha = 0.1\); \(p\)-value \(= 0.053\) Decision: reject \(H_0\) because \(p\)-value \(<\alpha\)
\(\bf (15)\) For each set of hypotheses, significance level, sample size, and test statistic. Give the critical value and state whether the test rejects or fails to reject the null hypothesis.
a.) \(H_0: \mu = 15\); \(H_A: \mu > 15\); \(n = 33\); \(\alpha = 0.05\); \(t_{obs} = 2.13\) critical value: \(t_{32, 0.95 } = 1.69\); Decision: reject \(H_0\) because \(t_{obs}>1.69\)
b.) \(H_0: \mu = 0\); \(H_A: \mu \neq 0\); \(n = 14\); \(\alpha = 0.01\) ; \(t_{obs} = -4.1\) critical value: \(t_{13, 0.995} = 3.01\); Decision: reject \(H_0\) because \(| t_{obs}| > 3.01\)
c.) \(H_0: p = 0.5\); \(H_A: p < 0.5\); \(n = 120\); \(\alpha = 0.1\); \(Z_{obs} = -1.5\) critical value: \(Z_{0.1} = -1.28\); Decision: reject \(H_0\) because \(Z_{obs} < -1.28\)
d.) \(H_0: p = 0.05\); \(H_A: p \neq 0.05\); \(n = 60\); \(\alpha = 0.03\); \(Z_{obs} = 1.95\) critical value: \(Z_{0.985}= 2.17\); Decision: fail to reject \(H_0\) because \(|Z_{obs}|\) is not greater than \(2.17\)
\(\bf (16)\) A toy manufacturer claims that \(50\%\) of their toy robots are defect-free. However, a quality control inspector suspects that the actual proportion of defect-free robots is different from what the manufacturer claims. To investigate, the inspector randomly selects \(100\) toy robots from a production batch and examines them for defects. After the inspection, the inspector finds that \(40\) out of the \(100\) toy robots are defect-free. To test their suspicion, they set up a two-tailed hypothesis test with a significance level \(\alpha = 0.1\) and hypotheses \(H_0: p_0 = 0.5\) (manufacturer’s claim) \(H_A: p \neq p_0\).
test.type = 'two.tail'
null.value = 0.5
sample.size = 100
estimated.proportion = 40/100
significance.lvl = 0.1
one.sample.prop.test(p0 = null.value,
phat = estimated.proportion,
n = sample.size,
alpha = significance.lvl,
test = test.type,
verbose = T)
## [1] ==================== test results ====================
## [1] test type = two.tail
## [1] H0: p0 = 0.5
## [1] HA: p != 0.5
## [1] estimate = 0.4
## [1] Estimated Standard Error = 0.05
## [1] Critical Value = 1.6449
## [1] 90% CI = [0.3194,0.4806]
## [1] Test statistic = -2
## [1] Pvalue = 0.0455
## [1] Decision: reject H0
## [1] ======================================================
At the \(10\%\) significance level we fail to reject the null hypothesis and conclude that the proportion of defective toy robots is not significantly different than \(0.5\)
\(\bf (17)\) A pharmaceutical company has developed a new drug designed to cure a specific bacterial infection. They claim that their drug is effective, with a cure rate of \(20\%\). However, a group of independent researchers believes that the cure rate less than what the pharmaceutical company claims. To investigate, the researchers conduct a clinical trial on \(300\) patients suffering from the bacterial infection. After the trial, they find that \(70\) out of the \(300\) patients were cured using the new drug. To test their suspicion, they set up a hypothesis test with a significance level of \(\alpha = 0.05\) with \(H_0: p_0 = 0.2\), \(H_A: p < p_0\)
test.type = 'lower.tail'
null.value = 0.2
sample.size = 300
estimated.proportion = 70/300
significance.lvl = 0.05
one.sample.prop.test(p0 = null.value,
phat = estimated.proportion,
n = sample.size,
alpha = significance.lvl,
test = test.type,
verbose = T)
## [1] ==================== test results ====================
## [1] test type = lower.tail
## [1] H0: p0 = 0.2
## [1] HA: p < 0.2
## [1] estimate = 0.2333
## [1] Estimated Standard Error = 0.0231
## [1] Critical Value = -1.6449
## [1] 95% CI = [0.1855,0.2812]
## [1] Test statistic = 1.4434
## [1] Pvalue = 0.9255
## [1] Decision: fail to reject H0
## [1] ======================================================
a.) Compute the critical value for this test
the critical value is \(Z_{\alpha} = -1.64\)
b.) Compute the test statistic \(Z_{obs}\)
the critical value is \[Z_{obs} = \frac{0.233 - 0.2}{\sqrt{\frac{0.2(1-0.2)}{300}}} = 1.44 \]
c.) Compute the \(p\)-value
the \(p\)-value is \[P(Z < Z_{obs} | H_0 \ \text{True}) = 0.9255 \]
At the \(5\%\) significance level, we fail to reject the null hypothesis and conclude that the proportion of patients cured by the drug is not significantly less than \(0.2\)
\(\bf (18)\) A soft drink company claims that their new “Zero Sugar” soda contains zero grams of added sugar per 12-ounce can. To test this claim, a random sample of \(25\) cans of this soda is selected. The sample mean is found to be \(1.3\) grams of added sugar per can, with a standard deviation of \(0.5\) grams. To determine if there is enough evidence to reject the company’s claim and conclude that the soda does not contain zero grams of added sugar per can, a group of researchers are interested in testing the following hypotheses: \(H_0: \mu_0 = 0\), \(H_A: \mu \neq \mu_0\) at the \(\alpha = 0.01\) significance level.
#The null hypothesis is H_0: m0 = 0 (no added sugar)
#The alternative hypothesis is H_A: m > m0 - this is the best option since it does not
#make since check for less than 0 added sugar.
test.type = 'upper.tail'
null.value = 0
sample.mean = 1.3
sample.sd = 0.5
sample.size = 25
significance.lvl = 0.01
one.sample.t.test(m0 = null.value,
xbar = sample.mean,
s = sample.sd,
n = sample.size,
alpha = significance.lvl,
test = test.type,
verbose = T)
## [1] ==================== test results ====================
## [1] test type = upper.tail
## [1] H0: m0 = 0
## [1] HA: m > 0
## [1] Estimate = 1.3
## [1] Estimated Standard Error = 0.1
## [1] Critical Value = 2.4922
## [1] 99% CI = [1.0203,1.5797]
## [1] Test statistic = 13
## [1] Degrees of Freedom = 24
## [1] Pvalue = 0
## [1] Decision: reject H0
## [1] ======================================================
At the \(1\%\) significance level, we reject the null hypothesis and conclude that the soda produced by the company contains significant amounts of added sugars
\(\bf (19)\) A group of botanists is studying the growth rate of Venus fly traps in a controlled greenhouse environment. Based on historical data, the typical growth rate of Venus fly traps is believed to be \(100\) millimeters per year. The botanists suspect that the current growth rate in their greenhouse is higher than this historical value. To investigate, they take a random sample of \(40\) Venus fly traps and measure their growth rates over a year. After the study, they find that the sample mean growth rate for the \(40\) Venus fly traps is \(105\) millimeters with a standard deviation of \(12\)mm. To test their suspicion, they set up a hypothesis test with a significance level of \(\alpha = 0.05\)
test.type = 'upper.tail'
null.value = 100
sample.mean = 105
sample.sd = 12
sample.size = 40
significance.lvl = 0.05
one.sample.t.test(m0 = null.value,
xbar = sample.mean,
s = sample.sd,
n = sample.size,
alpha = significance.lvl,
test = test.type,
verbose = T)
## [1] ==================== test results ====================
## [1] test type = upper.tail
## [1] H0: m0 = 100
## [1] HA: m > 100
## [1] Estimate = 105
## [1] Estimated Standard Error = 1.8974
## [1] Critical Value = 1.6849
## [1] 95% CI = [101.1622,108.8378]
## [1] Test statistic = 2.6352
## [1] Degrees of Freedom = 39
## [1] Pvalue = 0.006
## [1] Decision: reject H0
## [1] ======================================================
The null hypothesis is that the mean growth rate of venus fly traps is \(100\)mm per year. \[H_0: \mu_0 = 100 \] The alternative hypothesis is that the growth rate is higher than \(100\)mm per year \[H_0: \mu > \mu_0\]
The critical value is \(t_{39, 0.95} = 1.68\)
\[t_{obs} = \frac{105 - 100}{12/\sqrt{40}} \approx 2.64 \]
The \(p\)-value is given by \[P(t > t_{obs}| H_0 \ \text{true}) = 1 - P(t \leq t_{obs}| H_0 \ \text{true}) = 1 - 0.994 = 0.006\]
At the \(5\%\) significance level, we reject the null hypothesis and conclude that the current growth rate is significantly higher than 100 mm per year.