Download

In this example R-script, I will demonstrate how you can use R to construct some of the graphs and descriptive statistics that we have learned so far. As an example dataset, we will be analyzing data from A survey of college students in Georgia. The data is given below:

Height Gender Haircut Job Studytime Smokecig Dated HSGPA CGPA HomeDist BrowseInternet WatchTV Exercise ReadNewsP Vegan PoliticalDegree PoliticalAff
65 1 25 1 7.0 0 0 3.90 3.30 1.00 60 120 0.0 4 0 4 2
71 0 12 0 2.0 0 1 3.79 3.13 5.00 20 120 1.0 0 0 2 3
68 1 4 0 4.0 0 1 3.00 3.60 15.00 60 90 1.0 1 0 2 3
64 1 0 1 3.5 0 1 3.90 3.50 0.75 10 120 15.0 1 0 4 2
64 1 50 0 4.5 0 1 3.60 3.50 40.00 30 120 6.0 2 0 5 2
66 0 10 1 3.0 0 3 3.20 3.75 5.00 30 120 7.0 5 0 4 1
74 0 0 0 3.0 0 0 3.78 3.47 1.00 0 0 7.0 4 0 6 2
69 1 18 0 8.0 1 1 3.20 2.80 40.00 10 60 4.0 2 0 6 2
65 1 55 0 5.5 0 0 3.50 2.88 7.00 15 90 2.5 3 0 7 2
68 1 41 1 5.5 0 0 3.80 3.28 3.00 10 60 3.0 0 0 5 2
65 1 40 1 5.0 0 1 3.89 3.53 2.90 15 60 4.0 1 0 6 2
66 1 14 0 7.0 0 1 3.80 3.50 4.00 30 180 5.0 3 0 2 2
62 1 35 1 9.0 0 0 4.00 3.98 1.20 15 45 7.5 4 0 6 2
72 0 10 0 4.0 0 1 3.30 2.60 2.00 120 180 60.0 5 0 4 3
69 1 45 1 10.0 0 0 3.70 3.50 30.00 10 60 0.0 0 0 6 2
69 1 0 0 3.0 0 0 3.90 3.98 40.00 30 20 1.0 1 0 7 2
68 0 0 0 3.0 0 0 3.50 3.75 2.50 30 60 8.0 5 0 5 2
72 0 0 0 3.5 0 1 3.90 3.67 2.50 75 90 7.0 0 0 5 2
66 1 20 0 4.5 0 1 4.00 3.75 8.00 40 120 3.0 5 0 4 2
69 0 12 1 3.5 0 0 3.50 3.90 40.00 30 120 5.0 1 0 6 2
68 0 7 0 2.0 0 1 3.79 3.10 4.00 5 60 14.0 2 0 4 3
60 0 0 1 6.0 0 1 3.90 3.14 7.00 40 120 15.0 7 0 6 2
69 1 40 0 6.0 0 2 3.50 3.80 3.00 10 60 7.0 2 1 2 1
72 0 12 1 4.0 0 1 3.80 3.70 10.00 60 30 3.0 2 0 5 2
65 1 25 0 3.0 0 0 4.00 3.87 0.70 5 0 3.5 0 0 7 2
72 0 8 0 4.0 0 1 3.50 3.31 10.00 30 120 25.0 0 0 6 2
74 0 0 1 10.0 0 0 2.55 3.14 0.85 70 120 6.0 0 0 4 3
67 1 150 0 4.5 0 0 3.80 2.98 5.00 180 90 2.0 1 0 4 1
69 1 40 1 7.0 0 1 4.00 4.00 0.50 120 90 5.0 0 0 6 2
71 1 18 0 5.0 0 1 4.00 3.77 1.00 35 15 10.0 4 0 6 2
62 1 10 0 3.0 0 0 4.00 4.00 0.75 60 60 2.0 3 0 2 1
68 1 25 0 8.0 0 1 3.80 3.49 3.00 30 60 3.5 5 0 5 2
67 1 15 0 5.0 0 1 4.00 3.99 70.00 15 60 5.0 5 0 3 3
63 1 15 1 8.0 0 1 4.00 3.78 1.00 30 0 6.0 5 0 3 1
63 1 40 0 4.5 0 1 4.00 3.92 1.50 60 60 5.0 6 0 4 3
64 1 77 0 2.0 0 1 4.00 3.77 5.00 45 0 5.0 4 0 2 3
65 1 15 0 3.0 0 1 4.00 3.83 2.00 30 30 3.0 1 1 6 2
69 1 31 0 6.0 0 1 3.86 3.86 2.00 30 60 3.0 3 0 6 2
68 1 35 1 10.0 0 1 4.00 3.86 2.50 10 30 5.0 0 0 6 2
65 1 150 0 2.0 0 0 4.00 3.93 2.00 60 180 3.0 2 0 6 2
64 1 20 0 12.0 0 0 3.90 3.91 1.00 30 30 5.0 7 0 2 1
67 1 12 0 5.0 0 0 4.00 4.00 1.00 180 30 4.0 2 0 4 3
72 0 8 0 4.0 0 0 4.00 3.73 7.00 60 180 2.5 5 0 4 3
68 1 20 1 3.0 0 0 4.00 3.75 2.50 30 180 2.0 2 0 3 1
65 1 20 1 4.0 0 0 4.00 3.99 1.00 60 30 9.0 7 0 6 2
68 1 80 0 4.0 1 1 4.00 3.80 0.75 60 0 6.5 3 1 1 1
70 1 75 0 4.0 0 1 4.00 3.77 1.50 45 60 2.0 3 0 4 3
62 1 10 0 5.0 0 0 4.00 3.95 1.50 90 0 3.0 0 1 5 3
64 1 25 1 15.0 0 1 4.00 3.74 42.00 60 0 10.0 0 0 6 2
69 1 45 1 4.5 0 0 3.70 3.65 5.00 15 30 2.0 3 0 3 3
72 0 12 1 4.0 0 1 3.75 3.83 0.75 90 10 3.0 5 0 6 2
70 0 20 0 4.0 0 0 3.94 4.00 12.00 15 25 3.0 7 0 6 2
62 1 11 1 5.0 0 0 3.90 3.20 0.60 150 60 3.0 1 1 4 3
70 1 40 0 5.0 0 1 3.90 3.60 1.00 10 60 3.0 1 1 4 2
74 0 14 0 6.0 0 0 4.00 3.75 1.00 60 30 6.0 3 0 5 2
65 1 60 0 3.0 0 1 3.97 3.77 0.75 45 0 4.0 0 0 3 3
65 1 20 1 3.0 0 1 4.00 3.83 2.00 30 20 3.0 1 0 6 2
66 0 11 0 2.0 1 3 4.00 3.70 2.00 120 0 0.0 0 0 5 2
62 1 30 1 2.3 0 1 3.60 2.50 6.00 30 60 2.0 4 0 5 2

You can download the GA Student Survey dataset here

Variable Description

#One of the first things we can do to investigate a dataset is to use the str() and summary() commands. 

# --> The str() command returns the type of each variable as interpreted by R. In R, there are four main types. Characters, Integers, Numeric, and Factor. A character is a qualitative variable. A factor is a qualitative variable that is represented with numeric values like 0 or 1. Integers are quantative discrete variables and numerics are for quantative continuous variables. 

# --> The summary() command will give a five number summary for every variable in the dataset. Be careful of qualitative variables recorded as integers. A five-number summary for these variables must be carefully interpreted


# Using str()
str(gss)
## 'data.frame':    59 obs. of  17 variables:
##  $ Height         : int  65 71 68 64 64 66 74 69 65 68 ...
##  $ Gender         : int  1 0 1 1 1 0 0 1 1 1 ...
##  $ Haircut        : int  25 12 4 0 50 10 0 18 55 41 ...
##  $ Job            : int  1 0 0 1 0 1 0 0 0 1 ...
##  $ Studytime      : num  7 2 4 3.5 4.5 3 3 8 5.5 5.5 ...
##  $ Smokecig       : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Dated          : int  0 1 1 1 1 3 0 1 0 0 ...
##  $ HSGPA          : num  3.9 3.79 3 3.9 3.6 3.2 3.78 3.2 3.5 3.8 ...
##  $ CGPA           : num  3.3 3.13 3.6 3.5 3.5 3.75 3.47 2.8 2.88 3.28 ...
##  $ HomeDist       : num  1 5 15 0.75 40 5 1 40 7 3 ...
##  $ BrowseInternet : int  60 20 60 10 30 30 0 10 15 10 ...
##  $ WatchTV        : int  120 120 90 120 120 120 0 60 90 60 ...
##  $ Exercise       : num  0 1 1 15 6 7 7 4 2.5 3 ...
##  $ ReadNewsP      : int  4 0 1 1 2 5 4 2 3 0 ...
##  $ Vegan          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoliticalDegree: int  4 2 2 4 5 4 6 6 7 5 ...
##  $ PoliticalAff   : int  2 3 3 2 2 1 2 2 2 2 ...
# using summary()
summary(gss)
##      Height          Gender          Haircut            Job        
##  Min.   :60.00   Min.   :0.0000   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:65.00   1st Qu.:0.0000   1st Qu.: 10.50   1st Qu.:0.0000  
##  Median :68.00   Median :1.0000   Median : 20.00   Median :0.0000  
##  Mean   :67.25   Mean   :0.7119   Mean   : 27.75   Mean   :0.3729  
##  3rd Qu.:69.00   3rd Qu.:1.0000   3rd Qu.: 40.00   3rd Qu.:1.0000  
##  Max.   :74.00   Max.   :1.0000   Max.   :150.00   Max.   :1.0000  
##    Studytime         Smokecig           Dated           HSGPA      
##  Min.   : 2.000   Min.   :0.00000   Min.   :0.000   Min.   :2.550  
##  1st Qu.: 3.000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:3.765  
##  Median : 4.500   Median :0.00000   Median :1.000   Median :3.900  
##  Mean   : 5.022   Mean   :0.05085   Mean   :0.678   Mean   :3.802  
##  3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.:1.000   3rd Qu.:4.000  
##  Max.   :15.000   Max.   :1.00000   Max.   :3.000   Max.   :4.000  
##       CGPA          HomeDist    BrowseInternet      WatchTV      
##  Min.   :2.500   Min.   : 0.5   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:3.495   1st Qu.: 1.0   1st Qu.: 15.00   1st Qu.: 30.00  
##  Median :3.750   Median : 2.5   Median : 30.00   Median : 60.00  
##  Mean   :3.612   Mean   : 8.0   Mean   : 46.53   Mean   : 65.85  
##  3rd Qu.:3.860   3rd Qu.: 6.5   3rd Qu.: 60.00   3rd Qu.:105.00  
##  Max.   :4.000   Max.   :70.0   Max.   :180.00   Max.   :180.00  
##     Exercise        ReadNewsP         Vegan        PoliticalDegree
##  Min.   : 0.000   Min.   :0.000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 3.000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:4.000  
##  Median : 4.000   Median :2.000   Median :0.0000   Median :5.000  
##  Mean   : 5.949   Mean   :2.593   Mean   :0.1017   Mean   :4.593  
##  3rd Qu.: 6.250   3rd Qu.:4.000   3rd Qu.:0.0000   3rd Qu.:6.000  
##  Max.   :60.000   Max.   :7.000   Max.   :1.0000   Max.   :7.000  
##   PoliticalAff  
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.119  
##  3rd Qu.:2.500  
##  Max.   :3.000
#======================================================================
#-----------------Description of a qualitative variable----------------
#======================================================================

#To demonstrate how to use R to create descriptive plots for qualitative variables, we will analyze the variable Gender. Note that there are several packages in R that can be used for plotting. R has a set of basic graphics that allow for plotting bar plots, boxplots, scatterplots and so on. However, it's most popular to use the ggplot2 package for high quality graphics. ggplot2 has a bit of a learning curve but it has a lot of flexibility for editing and customizing plots. The examples provided herein will be done with both basic R plotting and also with ggplot2

# the following packages are used in this demonstration:
library(ggplot2)
library(ggthemes)
library(ggpubr)

#to tell R that this is qualitative variable we must convert it to a "factor" this tells R to treat the 0 and 1 values as categories

gss$Gender = as.factor(gss$Gender)
# now the variable will be reported as a factor with two levels corresponding to the two genders in the data
str(gss$Gender)
##  Factor w/ 2 levels "0","1": 2 1 2 2 2 1 1 2 2 2 ...
# applying the summary() commmand to a factor will count the number of observations in each factor level
summary(gss$Gender)
##  0  1 
## 17 42
#For both bar chart and the pie chart we need to supply the unique values and their proportions. We can compute the proportions (relative frequencies) by first computing the frequency using the summary() command
#frequency
freq = summary(gss$Gender)

#relative frequency is then the frequency divided by the sample size (here rounded to the nearest hundredth)
relative.freq = round(freq/sum(freq), 2)

#now we will repackage the variable Gender as a frequency table:
ft = cbind.data.frame(Gender = c('Female', 'Male'), Proportion = relative.freq)
print(ft)
##   Gender Proportion
## 0 Female       0.29
## 1   Male       0.71
# We will first demonstrate a pie chart and barplot using the basic plotting mechanics in R. You can use ?pie and ?barplot to read the documentation for these functions and learn more about the arguments

# first argument is a set of non-negative values such as counts or proportions
pie(ft$Proportion, 
    #supply lables
    labels = ft$Gender, 
    #edit colors
    col = c('red','orange'))

#first argument is the height of the bars
barplot(height = ft$Proportion, 
        #supply the legend labels
        legend = ft$Gender, 
        #color the bars
        col = c('red','orange'),
        #additional arguments to change the position of the legend 
        args.legend=list(
      x=1,
      y=1,
      bty = "n"
    ))

#Any ggplot is built up from adding together different plot components. The general recipe is a basic plot frame ggplot() plus geometry elements that correspond to the custom plot components you want. Examples include:
#geom_bar()
#geom_boxplot()
#geom_histogram()
#geom_point()
#geom_line()
#geom_dot()
#geom_step()
# and many many more...
#many of these components can be combined in differet ways.

# every plot must include an aes() argument which is called the asthetics mapping. This tells ggplot which variable will be the x,y,and z coordinates, as well as coloring schemes and sizes of points or lines. 

#Pie chart using the ggplot2 package
pie1 = ggplot(ft, aes(x = "", y = Proportion, fill = Gender)) +
  # geom_col is another type of barplot offered by ggplot2
  geom_col(color = "black") +
  #label the proportions onto plot
  geom_text(aes(label = Proportion), 
            position = position_stack(vjust = 0.6), size = 5) +
  #convert bar plot into pie chart
  coord_polar(theta = "y") +
  # use reds palette
  scale_fill_brewer(palette = 'Reds') +
  #use void theme to remove all axes from plot area
  theme_void()+
  # change legend text side
  theme(legend.text = element_text(size = 12))+
  # Add a title 
  ggtitle('Gender')

plot(pie1)

# Similarly, the ggplot can make barplots from the the same frequency table we created above:

#bar chart
BP1 = ggplot(data = ft, aes(x = Gender, y = Proportion, fill = Gender))+
  #apply bar chart
  geom_bar(stat = 'identity', color = 'black', size = 0.8)+
  #use a base plotting theme - HC theme
  theme_hc()+
  #change the tick points of the y axis
  scale_y_continuous(breaks = seq(0, 0.5, 0.1))+
  #use reds color palette
  scale_fill_brewer(palette = 'Reds')+
  #set plot legend to top of plot
  theme(legend.position = 'top')+
  #the theme command controls various asthetics related to the plot area, margin, and text
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14))+
  # Add a title to the plot
       ggtitle("Gender")


plot(BP1)

#======================================================================
#----------------Description of a quantitative variable----------------
#======================================================================

#For a description of a quantitative variable we will analyze the variable Studytime which is the self recorded amount of time (in hours) that Georgia college students spent studying each week.


#We will start by making a simple stem plot. R has a nice built in function called stem() which makes this process very simple and easy.

print(gss$Studytime)
##  [1]  7.0  2.0  4.0  3.5  4.5  3.0  3.0  8.0  5.5  5.5  5.0  7.0  9.0  4.0 10.0
## [16]  3.0  3.0  3.5  4.5  3.5  2.0  6.0  6.0  4.0  3.0  4.0 10.0  4.5  7.0  5.0
## [31]  3.0  8.0  5.0  8.0  4.5  2.0  3.0  6.0 10.0  2.0 12.0  5.0  4.0  3.0  4.0
## [46]  4.0  4.0  5.0 15.0  4.5  4.0  4.0  5.0  5.0  6.0  3.0  3.0  2.0  2.3
#Looking at the values of the data, we can see that time is recorded to the nearest half hour. 

#Stem plot
stem(gss$Studytime)
## 
##   The decimal point is at the |
## 
##    2 | 0000030000000000555
##    4 | 000000000055555000000055
##    6 | 0000000
##    8 | 0000
##   10 | 000
##   12 | 0
##   14 | 0
#by default stem() will try to choose the "prettiest" way to make the stem chart using an internal way of estimating the number of "bins" 

#to control this you can set the "scale" argument. we will remake the stem plot using a scale value of 2 - this will prevent the rounding to the nearest even value that is seen in the previous plot 
stem(gss$Studytime, scale = 2) 
## 
##   The decimal point is at the |
## 
##    2 | 000003
##    3 | 0000000000555
##    4 | 000000000055555
##    5 | 000000055
##    6 | 0000
##    7 | 000
##    8 | 000
##    9 | 0
##   10 | 000
##   11 | 
##   12 | 0
##   13 | 
##   14 | 
##   15 | 0
#R base graphics does not have a good equivalent function for making dotplots...
#Dot plot using ggplot
ggplot(data = gss, aes(x = Studytime))+
  geom_dotplot(dotsize = 0.5)+
  theme_classic()+
  #remove y lable
  ylab("")+
  #set x label
  xlab("Study Time (hrs)")+
  #give plot a title and subtitle
  ggtitle('Dotplot of Student Study Time (Hrs)', subtitle = 'Georgia Student Survey')+
  #rescale x axis to show more tick marks
  scale_x_continuous(limits = c(min(gss$Studytime), max(gss$Studytime)), 
                     breaks = seq(min(gss$Studytime), max(gss$Studytime), 1))+
  theme(axis.text.x = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        #remove y axis and background
        panel.background = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.line.y = element_blank())

#Next we will demonstrate how to construct histograms using base R and ggplot2. Note that both packages will usually automatically determine the number of bins to use. The hist() function in R uses the Sturges method as default which is a resonable choice for selecting the number of bins that we discussed in class. you can use ?hist to find out more about the input arguments for this function.
#Histogram of studytime using Base R graphics
hist(gss$Studytime, 
     xlab = 'Study Time (hrs)',
     ylab = 'Frequency',
     main = 'College Student Weekly Time Spent Studying')

#Note that geom_histogram will try to use 30 bins as its default. If you want to choose a custom number of bins or binwidths you can use additional arguments. Use ?geom_histogram() to see all arguments  
#Histogram using ggplot2: 
ggplot(data = gss, aes(x = Studytime))+
  geom_histogram(color = 'black', fill = 'lightblue')+
  theme_classic2()+
  ylab("Frequency")+
  xlab("Study Time (hrs)")+
  theme(axis.text.x = element_text(size = 12),
        axis.title.x = element_text(size = 14))

# Histogram using ggplot2 and with simple square method for the number of bins
#find the number of observations
n = length(gss$Studytime)
k = ceiling(sqrt(n)) # rounded up square root of the sample size

ggplot(data = gss, aes(x = Studytime))+
  geom_histogram(color = 'black', fill = 'lightblue', bins = k)+
  theme_classic2()+
  ylab("Frequency")+
  xlab("Study Time (hrs)")+
  theme(axis.text.x = element_text(size = 12),
        axis.title.x = element_text(size = 14))

#Next we demonstrate the boxplot using Base R graphics and ggplot2
#Boxplot using base R

boxplot(gss$Studytime, 
        # make the boxplot horizontal instead of vertical
        horizontal = T,
        # set axis label
        xlab = 'Study Time (Hrs)',
        # set plot title
        main = 'College Student Weekly Study Time')

#if you want to label the boxplot you can use the text() function (this function is actually applicable to all base R plots)
#find min, Q1, median, Q3, and max
minss = min(gss$Studytime)
Q1 = quantile(gss$Studytime, 0.25, type = 2)
medss = median(gss$Studytime)
Q3 = quantile(gss$Studytime, 0.75, type = 2)
maxss = max(gss$Studytime)

#now group all estims into a vector
estimates = c(minss, Q1, medss, Q3, maxss)

boxplot(gss$Studytime, 
        # make the boxplot horizontal instead of vertical
        horizontal = T,
        # set axis label
        xlab = 'Study Time (Hrs)',
        # set plot title
        main = 'College Student Weekly Study Time')
# the estimates will serve as the x-coords on the plot
text(x=estimates, 
     # create the labels using the paste function 
     labels =paste(c("Min", "Q1","Median", "Q3", "Max"), estimates, sep = "="),
     # set the y-positions of the labels - labels will be offset so they dont overlap
     y=c(1.3,1.25,1.3,1.25,1.3), cex = 0.6)
#also add the IQR
text(x=medss, labels =paste("IQR", Q3 - Q1, sep = " = "), y=1.5, cex = 0.7)

#Now using ggplot2 - Note that adding labels on ggplot2 is a bit more complicated but it can be done using the geom_label or geom_text functions
minss = min(gss$Studytime)
Q1 = quantile(gss$Studytime, 0.25, type = 2)
medss = median(gss$Studytime)
Q3 = quantile(gss$Studytime, 0.75, type = 2)
maxss = max(gss$Studytime)

ggplot()+
  geom_boxplot(aes(x = gss$Studytime),
               fill = 'lightgrey',
               #choose shape of outlier points
               outlier.shape = 23,
               #choose fill color 
               outlier.fill = 'green',
               # choose size
               outlier.size = 5)+
        # add min and max values as points
        geom_point(aes(x = c(minss, maxss), y = c(0, 0)), 
                   shape = 21, 
                   fill = 'red', 
                   size = 4)+
  
  # label the min, Q1, median, Q3, and max
        geom_label(aes(x = minss, y = 0.05), label = paste('min =', minss, collapse = ''), 
                   label.size = 0.3,)+
        geom_label(aes(x = maxss, y = 0.05), label = paste('max =', maxss, collapse = ''),
                   label.size = 0.3,)+
        geom_label(aes(x = medss, y = 0.45), label = paste('median =', medss, collapse = ''),
                   label.size = 0.3,)+
        geom_label(aes(x = Q1, y = 0.405), label = paste('Q1 =', Q1, collapse = ''),
                   label.size = 0.3,)+
        geom_label(aes(x = Q3, y = 0.405), label = paste('Q3 =', Q3, collapse = ''),
                   label.size = 0.3,)+
        geom_label(aes(x = medss, y = 0.495), label = paste('IQR =', Q3-Q1, collapse = ''),
                   label.size = 0.3)+
        scale_x_continuous(limits = c(minss-0.5, 
                                      maxss+0.5),
                           n.breaks = 10)+
        theme_classic()+
        #remove y axis and change x-axis text size
        theme(axis.text.y = element_blank(),
              axis.ticks.y = element_blank(),
              axis.text.x = element_text(size = 12),
              axis.title.y = element_blank(),
              axis.line.y = element_blank())+
        xlab('Study Time (Hrs)')+
        ggtitle('College Student Weekly Study Time')