Statistics is a scientific field that deals with the collection, description, analysis, interpretation, and presentation of data. We see and use data in nearly all aspects of our everyday lives: politics, medicine, forecasting, finance, marketing, etc. In short, we use statistics to ask and answer scientific questions and make predictions. To put it another way, statistics is the science of data
data are pieces of factual information, such as measurements which are recorded and used for the purpose of analysis. In summary, data is the raw information from which statistics are created. Statistics are the results of data analysis - its interpretation and presentation.
Typically data are represented in a square table called a data table. The fundamental unit of data is an observation - a single row of a data table representing a set of measurements on one or more variables.
A variable is a characteristic of an observation. In a typical data table, the variables are represented by the columns of the data table.
The following example are fictional data consisting of 20 observations of the height, age, and blaster accuracy of stormtroopers graduating from the imperial stormtrooper academy.
Identification Number | Duty Posting | Height (cm) | Age | Blaster Accuracy | Rank |
---|---|---|---|---|---|
FN-2414 | Berchest Station | 184.9 | 19 | 0.62 | PV1 |
FN-2462 | Death Star | 193.3 | 20 | 0.66 | PV2 |
FN-2178 | Death Star | 191.0 | 20 | 0.77 | CPL |
FN-2525 | Lothal | 186.7 | 23 | 0.61 | PFC |
FN-2194 | Corellia | 194.6 | 21 | 0.66 | PV1 |
FN-2937 | Fondor Ship Yard | 191.9 | 22 | 0.75 | PV2 |
FN-2817 | Fondor Ship Yard | 189.5 | 21 | 0.59 | CPL |
FN-2117 | Death Star | 193.5 | 21 | 0.66 | PFC |
FN-2298 | Corellia | 193.4 | 24 | 0.66 | PV1 |
FN-2228 | Berchest Station | 193.2 | 21 | 0.71 | PV2 |
FN-2243 | Death Star | 192.8 | 24 | 0.69 | CPL |
FN-2013 | Corellia | 192.3 | 18 | 0.62 | PFC |
FN-2373 | Lothal | 190.3 | 22 | 0.60 | PV1 |
FN-2664 | Berchest Station | 189.5 | 21 | 0.72 | PV2 |
FN-2601 | Fondor Ship Yard | 189.2 | 21 | 0.73 | CPL |
FN-2602 | Lothal | 188.2 | 22 | 0.62 | PFC |
FN-2767 | Death Star | 189.8 | 20 | 0.76 | PV1 |
FN-2708 | Death Star | 186.3 | 20 | 0.61 | PV2 |
FN-2090 | Fondor Ship Yard | 197.7 | 19 | 0.64 | CPL |
FN-2952 | Corellia | 194.5 | 19 | 0.57 | PFC |
The following are 15 observations of study done on the morphology of Egyptian skulls from 5 epochs of Egyptian history. For each skull, the epoch, and several measurements characterizing the shape of the skull are recorded.
Epoch | maximal breadth | basiregmatic height | basilveolar length | nasal height |
---|---|---|---|---|
c200BC | 139 | 130 | 94 | 53 |
c3300BC | 131 | 134 | 96 | 50 |
c4000BC | 131 | 138 | 89 | 49 |
c4000BC | 124 | 138 | 101 | 46 |
c200BC | 135 | 131 | 99 | 51 |
c4000BC | 131 | 134 | 97 | 54 |
c1850BC | 133 | 131 | 96 | 49 |
c3300BC | 129 | 126 | 91 | 50 |
c4000BC | 132 | 131 | 101 | 49 |
c200BC | 133 | 136 | 95 | 52 |
c200BC | 141 | 130 | 87 | 49 |
c3300BC | 126 | 131 | 100 | 48 |
c200BC | 129 | 135 | 95 | 47 |
c4000BC | 135 | 135 | 103 | 47 |
c3300BC | 131 | 139 | 98 | 51 |
Variables are distinguished by the type of information they represent. One basic distinction is between qualitative variables and quantitative variables.
qualitative variables - also called categorical variables, represent non numeric qualities or characteristics that can be placed in distinct categories. What are the qualitative variables in the stormtrooper data above?
quantitative variables - Sometimes referred to as numeric variables, are numerical characteristics which have an inherent order or ranking. In the Egyptian skull data which variables are quantitative variables?
Qualitative variables can be further divided between nominal and ordinal variables.
nominal variables - non numeric qualities or characteristics that can be placed in distinct categories that do not have a natural ordering. Examples of variables that cannot be ordered are gender, race, eye color, or political party. In the stormtrooper data, which variables would be considered nominal variables?
ordinal variables - non numeric qualities or characteristics that can be placed in distinct categories with an inherent ordering. Examples include education level (i.e bachelors, masters, PhD) or temperatures (i.e cool, warm, hot).
Quantitative variables can also be further divided between two sub-categories: discrete and continuous variables.
discrete variables - are quantitative variables that take on distinct, countable values (i.e whole numbers or integers) such \(0,1,2,3 ...\). Any quantitative variable that represents counts of objects/items are quantitative discrete. For example, age is a count of the number of year someone/something has been alive.
continuous variables - are variables that can take on infinite number of values within an interval of any two specific values (e.g temperature \(^{\circ} C / ^{\circ} F\), height in inches, speed in miles per hour).
Statistics is generally concerned with studying properties of a population. You can think of a population as a collection of all possible persons, things, or objects that you are studying. Another way to think of a population is as a collection of all possible observations of a variable - both observed and unobserved.
For many populations, it is very difficult or even impossible to observe all members of the population. To tackle this challenge statistics uses samples to learn about the properties of a population. A sample is a subset of the population that is actually observed. You can think of a sample as a set of the observed observations.
The idea of sampling is to select a portion of individuals or objects that are representative of the population
To illustrate the distinction between a sample and a population, let’s take the example of a wildlife biologist researching the paw size of mountain lions in the state of Idaho. In this scenario, the population being examined encompasses all the mountain lions residing in the state, totaling approximately 2,000 individuals. However, due to the considerable challenges and expenses associated with tracking and observing every existing member, the biologist opts to document the paw size of a more manageable subset of 20 mountain lions. The 20 mountain lions for which the biologist records paw sizes constitute a sample from the larger population of mountain lions.
In statistics we typically represent the size (i.e number of observations) of a population with the letter \(N\) and the size of a sample with the letter \(n\). In the above example of the wildlife biologist studying mountain lion paw size, the population has size \(N = 2,000\) and the sample the biologist took was of size \(n = 20\).
Statistics primarily deals with estimation – the process of inferring an unknown quantity about a population using set of sample data. An estimator is a mathematical function that estimates a given statistic based on observed data.
We will use Greek letters such as \(\theta, \sigma, \mu\) to represent parameters. These symbols with “hats” like \(\hat{\theta}, \hat{\sigma}, \hat{\mu}\) are used to represent estimators/statistics.
Two statistics that we will deal with in this course are the sample mean and sample proportion.
A proportion describes the fraction of a whole that represent some property or category. It can be expressed as a value between \(0\) and \(1\) or as a percentage. The sample proportion \(\hat{p}\) is the fraction of observations in the sample that represent a particular category. The population proportion is the fraction of observations in the population that represent a particular category. The sample proportion is defined mathematically as \[ \hat{p} = \frac{\text{Number of observed observations in category}}{n}\] The population proportion is defined mathematically as \[ \hat{p} = \frac{\text{All observations in category}}{N}\] The sample proportion \(\hat{p}\) is used to estimate the population proportion \(p\)
The arithmetic mean (sometimes called average) describes the center of a population or set of data. The sample mean \(\bar{x}\) describes the central tendency of the observation in the sample. The population mean describes the central tendency of the entire population. The sample mean is defined mathematically as \[\bar{x} = \sum_{i = 1}^{n} \frac{x_i}{n}\] The population mean is defined mathematically as \[ \mu = \sum_{i = 1}^{N} \frac{x_i}{N}\] The sample mean \(\bar{x}\) is used to estimate the population mean \(\mu\)
The tools for estimation allow us to approximate almost everything about populations using only samples. With estimates we can:
Recall that statistics is a science that deals with the collection, description, analysis, interpretation, and presentation of data. The collection of data is referred to as the design. In statistics, design concerns the formulation of statistical questions and the process/method in which we plan to collect data to answer our statistical question.
Descriptive statistics - refers to a organizational step that occurs before we do any analysis or make any decisions/predictions. It includes describing the observations in a sample using statistics or describing a population using parameters. It can also include graphical summaries of a sample or population.
Inferential statistics - refers to using a sample (usually a statistic) to answer a question about a population such as estimating the value of a parameter. It includes estimating parameters, making decisions, and prediction.
Key Point: Descriptive statistics can be applied to samples or populations. Inferential statistics are only applied to samples in order to make inferences about a population.
As we learned in the previous lecture, descriptive statistics is the first step to statistical analysis and involves the summary and organization of data. One of the main ways we can summarize data is describe the characteristics of the distribution of the variable(s) in the data. A distribution tells us something about what kinds of values a variable can have and how often they occur.
Student | Age | Response |
---|---|---|
1 | 13 | Never |
2 | 13 | Sometimes |
3 | 15 | Never |
4 | 17 | Often |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
743 | 16 | Rarely |
Formally, a distribution is a function that gives (a) the possible values of a variable and (b) how often each value occurs. The how often is usually described using frequency or relative frequency .
The frequency of a given value is simply a count of the number of times that the given values occurs
The relative frequency of a given value is the proportion of times that the given value occurs
We can also use the cumulative relative frequency to describe how often a variable occurs. The cumulative relative frequency is the proportion of values that are less than or equal to a given value in a distribution.
Consider the following hypothetical set of \(10\) exam scores (out of 10 points) for students in a statistics course \[\{5, 6, 6, 7, 7, 7, 8, 9, 9, 10\}\]
The frequency for an exam score of \(6\) is \(freq(6) = 2\) because \(2\) of the \(10\) students had an exam score of \(6\).
Similarly, the relative frequency for an exam score of \(6\) is \(2/10 = 0.2\).
The cumulative relative frequency for an exam score of \(6\) will be the total number of values in the data that are less than or equal to the value \(6\). This will be the frequency of \(5\)’s plus the frequency of \(6\)’s which will be \((1 + 2)/10 = 3/10\) or \(0.3\). Alternatively we could simply add the relative frequency for an exam score of \(5\) to the relative frequency for an exam score of \(6\): \(0.1 + 0.2 = 0.3\)
Score | Frequency | Relative Frequency | Cumulative Relative Frequency |
---|---|---|---|
5 | 1 | 0.1 | 0.1 |
6 | 2 | 0.2 | 0.3 |
7 | 3 | 0.3 | 0.6 |
8 | 1 | 0.1 | 0.7 |
9 | 2 | 0.2 | 0.9 |
10 | 1 | 0.1 | 1.0 |