NO CLASS: Monday, Jan. 15th 2024




CLASS CANCELED: Wednesday, Jan. 17th 2024




Lecture 3: Friday, Jan. 19th 2024

Describing and Visualizing Distributions (Continued)

Recall that descriptive statistics is a preliminary step to statistical inference and is concerned with summarizing and characterizing a set of data from a sample or population . Creating a frequency table can be a useful first step for describing the variables in a set of data. However, there are many other ways to describe or visualize data. This week we will introduce some different ways to visualize and describe a set of data and what we can learn from these descriptions.

More on frequency tables

Recall that a frequency table shows the distribution of a variable organized as a table. The first column represents the possible values of the variable and the remaining columns describe how often each value occurs using frequency, relative frequency, cumulative relative frequency (if applicable) or potentially all three.

We can create a frequency table for any type of variable. However, when dealing with quantitative continuous variables we cannot list all possible values because the number of possible values is infinite. Therefore, we usually have to make a decision about how we want to group values of the variable, a process called binning. Binning is a way to group numbers of more-or-less continuous values into a smaller number of “bins”. These bins represent a collection of values that fall into a certain range and make listing the possible values a lot easier.

Consider the following trivial example using the quantitative continuous variable \[X = \{1.23, \ 1.38, \ 1.76, \ 2.60, \ 2.60, \ 2.79, \ 2.89, \ 3.12, \ 3.15, \ 3.18, \ 3.65, \ 3.77, \ 3.87, \ 4.12, \ 4.41\}\]

Each observation of the variable \(X\) is measured with a precision of two decimal places, and no two observations in the sample are identical. As a result if we simply listed the possible values we would have \(15\) unique values each with a frequency of \(1\) which would produce a frequency table that wasn’t very informative. We can, instead, use binning to extract more meaningful information about the distribution of \(X\). Now consider the frequency table using binning:

Bin Frequency Relative Frequency Cumulative Relative Frequency
\(1 < X \leq 1.5\) 2 0.13 0.13
\(1.5 < X \leq 2\) 1 0.07 0.20
\(2 < X \leq 2.5\) 0 0.00 0.20
\(2.5 < X \leq 3\) 4 0.27 0.47
\(3 < X \leq 3.5\) 3 0.20 0.67
\(3.5 < X \leq 4\) 3 0.20 0.87
\(4 < X \leq 4.5\) 2 0.13 1.00

The following is another example using \(n = 272\) observations collected on the waiting time between eruptions and the duration of the eruption of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Duration (min) Waiting Time (min)
3.6 79
1.8 54
3.333 74
2.283 62
4.533 85
2.883 55
4.7 88
3.6 85
\(\vdots\) \(\vdots\)

Consider a frequency table to describe the distribution of the variable waiting time (min)

Waiting Time Frequency Relative Frequency Cumulative Relative Frequency
\(40 < X \leq 45\) 4 0.015 0.015
\(45 < X \leq 50\) 22 0.081 0.096
\(50 < X \leq 55\) 33 0.121 0.217
\(55 < X \leq 60\) 24 0.088 0.305
\(60 < X \leq 65\) 14 0.051 0.357
\(65 < X \leq 70\) 10 0.037 0.393
\(70 < X \leq 75\) 27 0.099 0.493
\(75 < X \leq 80\) 54 0.199 0.691
\(80 < X \leq 85\) 55 0.202 0.893
\(85 < X \leq 90\) 23 0.085 0.978
\(90 < X \leq 95\) 5 0.018 0.996
\(95 < X \leq 100\) 1 0.004 1.000
  • How long does a visitor typically have to wait before seeing old faithful erupt?
  • Is there more than one “typical” value in this distribution?

IMPORTANT: one thing to consider is that grouping continuous data into discrete intervals leads to a loss of precision. Sometimes this can result in the loss of valuable information present in the original data. In addition, how we choose to bin the data (i.e how wide we make the intervals) may lead to different, sometimes erroneous, conclusions about the distribution and patterns within the data.

Consider the frequency table for the waiting time (min) with only two bins

Waiting Time Frequency Relative Frequency Cumulative Relative Frequency
\(40 < X \leq 70\) 107 0.393 0.393
\(70 < X \leq 100\) 165 0.607 1.000
  • How does using only two bins/intervals alter your conclusion regarding the typical time a visitor must wait before witnessing Old Faithful’s eruption?

Features of a distribution

Descriptive statistics or graphical summaries can tell us a lot about the variables in our data. Graphs help us visualize trends in the data, identify problematic observations, and can help inform the types of analyses we should use. We may be interested in certain features related to the distribution(s) of the variable(s) in our data, such as the shape, center, and variability .

  • shape refers to how observations cluster across the range of a variable. Certain aspects to note would be where observations are densely packed (values with high frequencies) and where are observations sparser (values with low frequencies). This can give us a feel for where a typical value would appear and what values may be considered “rare” or “extreme”.

  • Center refers to the ``middle point” on the distribution and also supplies information about where does a typical value falls.

  • Variability refers to how tightly observations cluster around the center of a distribution. When values are tightly packed they are said to have low variability vs when they are spread out they are said to have high variability

  • modal category is a feature of qualitative data only and refers to the most frequent category of a qualitative variable.

Graphs for qualitative data

There are several graphs that can be used to display qualitative data. The most common are bar graphs and pie charts.

  • A Bar graph is simple and widely used plot for displaying the distribution of categorical variables. Each category is represented by a bar, and the height of the bar corresponds to the frequency or proportion of observations in that category. When the bars are sorted in decreasing frequency it is called a Pareto chart.

  • A Pie chart is useful for illustrating the proportional contribution of each category to the whole. Beyond this, it is not a particularly useful plot when describing the features of a distribution.

  • Ex. Recall the example from Week 1 regarding the Pew Research survey asking teens to rate how distracted they are by their cell phones. The following are a bar graph and pie chart of student responses

  • what is the modal category of the variable Response ?

Graphs for quantitative data

There are also many plots that can be used with quantitative data. Some of the most commonly used graphs are listed below.

  • A dot plot shows a dot for each observation placed just above the value on the number line for that observation. To construct a dot plot:
      1. Draw a horizontal line, label it with the name of variable, and mark regular values of the variable on the line.
      1. For each observation, place a dot above its value on the number line.
  • A stem and leaf plot is similar to a dot plot in that it displays individual observations of a variable. Each observation in the plot consists of a stem and a leaf. The stem is all of the digits except fo the last digit. The leaf is the last digit. For example, take the observation \(12.3\) the stem for this observation is \(12\) and the leaf is \(3\). To construct a stem and leaf plot:
      1. Draw a table with two columns and label the left column “Stem” and the right column “leaf”.
      1. Next, sort the observations from smallest to largest and place the stems in the left column of the your table in increasing order.
      1. in the left column, fill in the leaves of each stem in increasing order.

Example: A dot plot constructed for the \(n = 272\) observations collected on the waiting time between eruptions and the duration of the eruption of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

  • Think about the shape of this distribution and how it relates to the time that a typical visitor can expect to wait before the geyser erupts. Is the result consistent with the frequency table we constructed earlier?

Example: A stem and leaf plot constructed for the \(n = 272\) observations collected on the waiting time between eruptions and the duration of the eruption of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   4 | 3
##   4 | 55566666777788899999
##   5 | 00000111111222223333333444444444
##   5 | 555555666677788889999999
##   6 | 00000022223334444
##   6 | 555667899
##   7 | 00001111123333333444444
##   7 | 555555556666666667777777777778888888888888889999999999
##   8 | 000000001111111111111222222222222333333333333334444444444
##   8 | 55555566666677888888999
##   9 | 00000012334
##   9 | 6

Both dot plots and stem and leaf plots are excellent choices for discrete variables with a small to moderate number of observations. However, when a variable consists of many observations, has low variability, or is continuous, it can be difficult to constructs these two plots.

  • When dealing with continuous data, it is often necessary to round observations to the nearest tenth before constructing either plot.

The The following are alternatives that work for variables of any number of observations.

More graphs for quantitative data

  • A Histogram is a specialized type of bar graph. It uses bars to portray the frequencies or relative frequencies of the possible outcomes for a quantitative variable of either type. The histogram uses binning to condense the view of a quantitative variable into a smaller set of intervals which makes it particularly useful when the number of observations is large. The width of the bins is arbitrary but all bins should be the same length and cover all possible values of the data. The steps for constructing a histogram are:

      1. Divide the data into non-overlapping intervals (bins) of equal length.
      1. Compute the frequency of each interval (this is the same as constructing the frequency table for the chosen bins)
      1. Label the x-axis with the midpoints or endpoints of each interval.
      1. Draw a bar over each value or interval with height equal to its frequency or relative frequency
  • In general, you can choose the the number of bins \(k\) you wish to use and set the bin width \(w\) accordingly or alternatively set the bin width you want and use the corresponding number of bins.

    • formula for choosing the number of bins based on bin width \(w\)

    \[ k = \frac{\max \ x - \min x}{w}\]

    • formula for choosing the width based on the number of bins \(k\)

    \[ w = \frac{\max x - \min x}{k}\]

    where \(\min\) and \(\max\) denote the smallest and larges values of the variable \(x\), respectively.

  • How many bins should I use? This is actually a question that has received a lot of attention in the scientific literature. Typically, you want to be sure that you don’t use “too few” bins as this can causing misleading conclusions about the features of the distribution. There are several rules that are based on the coverage and number of observations of the variable:

    • square root method is defined as \(k = \text{round}(\sqrt{n})\). This is a fairly easy and safe rule for choosing the number of bins to use. However, for large sample sizes it can lead to a very large number of bins.

    • Sturges Rule is defined as \(k = \text{round}(\log_2 n)+1\). This method is not great for a small number of observations (i.e \(n < 30\))

    • Rices Rule is defined as \(k = 2\sqrt[3]{n}\). This rule is a nice intermediate between Sturges Rule and the square root method.

Consider the following example of histograms constructed for the \(n = 272\) observations collected on the waiting time between eruptions and the duration of the eruption of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Try it out: Construct a histogram for the following 10 observations of the variable \(X\) \[X = \{ -1.49,-0.65,-0.6,-0.54,-0.45,0.01,0.17,0.27,0.51,1.34 \}\] using \(k = 4\) bins/intervals

  • In this example, we are given the desired number of bins, so we can use our equation to find the width of the bins \(w\).

\[ w = \frac{1.34 - (-1.49)}{4} \approx \frac{3}{4} \ \text{or} \ 0.75 \]

  • now that we know about how wide each interval should be we can easily comupute the intervals. First, let \(a\) be the minimum value of \(X\) and \(b\) be the maximum value such that \(\min(X) = a, \max(X) = b\). The first interval is defined as \([a \leq X < a+w]\) which gives \([-1.5 \leq X < -0.75]\). The second interval is given by \([-0.75 \leq X < -0.75 + w]\) or \([-0.75 \leq X < 0]\). Continuing this pattern gives the remaining bins yields the following frequency table:
X Frequency Relative Frequency Cumulative Relative Frequency
\(-1.5 < X \leq -0.75\) 1 0.1 0.1
\(-0.75 < X \leq 0\) 4 0.4 0.5
\(0 < X \leq 0.75\) 4 0.4 0.9
\(0.75 < X \leq 1.5\) 1 0.1 1.0
  • From the frequency table above, we can now easily construct the histogram of \(X\) by simply drawing the appropriate bars over each interval/bin:

If you’d like to read more about scaling interval widths you can check out these references:

Sturges, Herbert A.”The choice of a class interval.” Journal of the american statistical association 21.153 (1926): 65-66.

Lane, David. Online statistics education: A multimedia course of study. Association for the Advancement of Computing in Education (AACE), 2003. – Chapter 2 “Graphing Distributions