In this example R-script, I will demonstrate how you can use R to construct some of the graphs and descriptive statistics that we have learned so far. As an example dataset, we will be analyzing data from A survey of college students in Georgia. The data is given below:
Height | Gender | Haircut | Job | Studytime | Smokecig | Dated | HSGPA | CGPA | HomeDist | BrowseInternet | WatchTV | Exercise | ReadNewsP | Vegan | PoliticalDegree | PoliticalAff |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
65 | 1 | 25 | 1 | 7.0 | 0 | 0 | 3.90 | 3.30 | 1.00 | 60 | 120 | 0.0 | 4 | 0 | 4 | 2 |
71 | 0 | 12 | 0 | 2.0 | 0 | 1 | 3.79 | 3.13 | 5.00 | 20 | 120 | 1.0 | 0 | 0 | 2 | 3 |
68 | 1 | 4 | 0 | 4.0 | 0 | 1 | 3.00 | 3.60 | 15.00 | 60 | 90 | 1.0 | 1 | 0 | 2 | 3 |
64 | 1 | 0 | 1 | 3.5 | 0 | 1 | 3.90 | 3.50 | 0.75 | 10 | 120 | 15.0 | 1 | 0 | 4 | 2 |
64 | 1 | 50 | 0 | 4.5 | 0 | 1 | 3.60 | 3.50 | 40.00 | 30 | 120 | 6.0 | 2 | 0 | 5 | 2 |
66 | 0 | 10 | 1 | 3.0 | 0 | 3 | 3.20 | 3.75 | 5.00 | 30 | 120 | 7.0 | 5 | 0 | 4 | 1 |
74 | 0 | 0 | 0 | 3.0 | 0 | 0 | 3.78 | 3.47 | 1.00 | 0 | 0 | 7.0 | 4 | 0 | 6 | 2 |
69 | 1 | 18 | 0 | 8.0 | 1 | 1 | 3.20 | 2.80 | 40.00 | 10 | 60 | 4.0 | 2 | 0 | 6 | 2 |
65 | 1 | 55 | 0 | 5.5 | 0 | 0 | 3.50 | 2.88 | 7.00 | 15 | 90 | 2.5 | 3 | 0 | 7 | 2 |
68 | 1 | 41 | 1 | 5.5 | 0 | 0 | 3.80 | 3.28 | 3.00 | 10 | 60 | 3.0 | 0 | 0 | 5 | 2 |
65 | 1 | 40 | 1 | 5.0 | 0 | 1 | 3.89 | 3.53 | 2.90 | 15 | 60 | 4.0 | 1 | 0 | 6 | 2 |
66 | 1 | 14 | 0 | 7.0 | 0 | 1 | 3.80 | 3.50 | 4.00 | 30 | 180 | 5.0 | 3 | 0 | 2 | 2 |
62 | 1 | 35 | 1 | 9.0 | 0 | 0 | 4.00 | 3.98 | 1.20 | 15 | 45 | 7.5 | 4 | 0 | 6 | 2 |
72 | 0 | 10 | 0 | 4.0 | 0 | 1 | 3.30 | 2.60 | 2.00 | 120 | 180 | 60.0 | 5 | 0 | 4 | 3 |
69 | 1 | 45 | 1 | 10.0 | 0 | 0 | 3.70 | 3.50 | 30.00 | 10 | 60 | 0.0 | 0 | 0 | 6 | 2 |
69 | 1 | 0 | 0 | 3.0 | 0 | 0 | 3.90 | 3.98 | 40.00 | 30 | 20 | 1.0 | 1 | 0 | 7 | 2 |
68 | 0 | 0 | 0 | 3.0 | 0 | 0 | 3.50 | 3.75 | 2.50 | 30 | 60 | 8.0 | 5 | 0 | 5 | 2 |
72 | 0 | 0 | 0 | 3.5 | 0 | 1 | 3.90 | 3.67 | 2.50 | 75 | 90 | 7.0 | 0 | 0 | 5 | 2 |
66 | 1 | 20 | 0 | 4.5 | 0 | 1 | 4.00 | 3.75 | 8.00 | 40 | 120 | 3.0 | 5 | 0 | 4 | 2 |
69 | 0 | 12 | 1 | 3.5 | 0 | 0 | 3.50 | 3.90 | 40.00 | 30 | 120 | 5.0 | 1 | 0 | 6 | 2 |
68 | 0 | 7 | 0 | 2.0 | 0 | 1 | 3.79 | 3.10 | 4.00 | 5 | 60 | 14.0 | 2 | 0 | 4 | 3 |
60 | 0 | 0 | 1 | 6.0 | 0 | 1 | 3.90 | 3.14 | 7.00 | 40 | 120 | 15.0 | 7 | 0 | 6 | 2 |
69 | 1 | 40 | 0 | 6.0 | 0 | 2 | 3.50 | 3.80 | 3.00 | 10 | 60 | 7.0 | 2 | 1 | 2 | 1 |
72 | 0 | 12 | 1 | 4.0 | 0 | 1 | 3.80 | 3.70 | 10.00 | 60 | 30 | 3.0 | 2 | 0 | 5 | 2 |
65 | 1 | 25 | 0 | 3.0 | 0 | 0 | 4.00 | 3.87 | 0.70 | 5 | 0 | 3.5 | 0 | 0 | 7 | 2 |
72 | 0 | 8 | 0 | 4.0 | 0 | 1 | 3.50 | 3.31 | 10.00 | 30 | 120 | 25.0 | 0 | 0 | 6 | 2 |
74 | 0 | 0 | 1 | 10.0 | 0 | 0 | 2.55 | 3.14 | 0.85 | 70 | 120 | 6.0 | 0 | 0 | 4 | 3 |
67 | 1 | 150 | 0 | 4.5 | 0 | 0 | 3.80 | 2.98 | 5.00 | 180 | 90 | 2.0 | 1 | 0 | 4 | 1 |
69 | 1 | 40 | 1 | 7.0 | 0 | 1 | 4.00 | 4.00 | 0.50 | 120 | 90 | 5.0 | 0 | 0 | 6 | 2 |
71 | 1 | 18 | 0 | 5.0 | 0 | 1 | 4.00 | 3.77 | 1.00 | 35 | 15 | 10.0 | 4 | 0 | 6 | 2 |
62 | 1 | 10 | 0 | 3.0 | 0 | 0 | 4.00 | 4.00 | 0.75 | 60 | 60 | 2.0 | 3 | 0 | 2 | 1 |
68 | 1 | 25 | 0 | 8.0 | 0 | 1 | 3.80 | 3.49 | 3.00 | 30 | 60 | 3.5 | 5 | 0 | 5 | 2 |
67 | 1 | 15 | 0 | 5.0 | 0 | 1 | 4.00 | 3.99 | 70.00 | 15 | 60 | 5.0 | 5 | 0 | 3 | 3 |
63 | 1 | 15 | 1 | 8.0 | 0 | 1 | 4.00 | 3.78 | 1.00 | 30 | 0 | 6.0 | 5 | 0 | 3 | 1 |
63 | 1 | 40 | 0 | 4.5 | 0 | 1 | 4.00 | 3.92 | 1.50 | 60 | 60 | 5.0 | 6 | 0 | 4 | 3 |
64 | 1 | 77 | 0 | 2.0 | 0 | 1 | 4.00 | 3.77 | 5.00 | 45 | 0 | 5.0 | 4 | 0 | 2 | 3 |
65 | 1 | 15 | 0 | 3.0 | 0 | 1 | 4.00 | 3.83 | 2.00 | 30 | 30 | 3.0 | 1 | 1 | 6 | 2 |
69 | 1 | 31 | 0 | 6.0 | 0 | 1 | 3.86 | 3.86 | 2.00 | 30 | 60 | 3.0 | 3 | 0 | 6 | 2 |
68 | 1 | 35 | 1 | 10.0 | 0 | 1 | 4.00 | 3.86 | 2.50 | 10 | 30 | 5.0 | 0 | 0 | 6 | 2 |
65 | 1 | 150 | 0 | 2.0 | 0 | 0 | 4.00 | 3.93 | 2.00 | 60 | 180 | 3.0 | 2 | 0 | 6 | 2 |
64 | 1 | 20 | 0 | 12.0 | 0 | 0 | 3.90 | 3.91 | 1.00 | 30 | 30 | 5.0 | 7 | 0 | 2 | 1 |
67 | 1 | 12 | 0 | 5.0 | 0 | 0 | 4.00 | 4.00 | 1.00 | 180 | 30 | 4.0 | 2 | 0 | 4 | 3 |
72 | 0 | 8 | 0 | 4.0 | 0 | 0 | 4.00 | 3.73 | 7.00 | 60 | 180 | 2.5 | 5 | 0 | 4 | 3 |
68 | 1 | 20 | 1 | 3.0 | 0 | 0 | 4.00 | 3.75 | 2.50 | 30 | 180 | 2.0 | 2 | 0 | 3 | 1 |
65 | 1 | 20 | 1 | 4.0 | 0 | 0 | 4.00 | 3.99 | 1.00 | 60 | 30 | 9.0 | 7 | 0 | 6 | 2 |
68 | 1 | 80 | 0 | 4.0 | 1 | 1 | 4.00 | 3.80 | 0.75 | 60 | 0 | 6.5 | 3 | 1 | 1 | 1 |
70 | 1 | 75 | 0 | 4.0 | 0 | 1 | 4.00 | 3.77 | 1.50 | 45 | 60 | 2.0 | 3 | 0 | 4 | 3 |
62 | 1 | 10 | 0 | 5.0 | 0 | 0 | 4.00 | 3.95 | 1.50 | 90 | 0 | 3.0 | 0 | 1 | 5 | 3 |
64 | 1 | 25 | 1 | 15.0 | 0 | 1 | 4.00 | 3.74 | 42.00 | 60 | 0 | 10.0 | 0 | 0 | 6 | 2 |
69 | 1 | 45 | 1 | 4.5 | 0 | 0 | 3.70 | 3.65 | 5.00 | 15 | 30 | 2.0 | 3 | 0 | 3 | 3 |
72 | 0 | 12 | 1 | 4.0 | 0 | 1 | 3.75 | 3.83 | 0.75 | 90 | 10 | 3.0 | 5 | 0 | 6 | 2 |
70 | 0 | 20 | 0 | 4.0 | 0 | 0 | 3.94 | 4.00 | 12.00 | 15 | 25 | 3.0 | 7 | 0 | 6 | 2 |
62 | 1 | 11 | 1 | 5.0 | 0 | 0 | 3.90 | 3.20 | 0.60 | 150 | 60 | 3.0 | 1 | 1 | 4 | 3 |
70 | 1 | 40 | 0 | 5.0 | 0 | 1 | 3.90 | 3.60 | 1.00 | 10 | 60 | 3.0 | 1 | 1 | 4 | 2 |
74 | 0 | 14 | 0 | 6.0 | 0 | 0 | 4.00 | 3.75 | 1.00 | 60 | 30 | 6.0 | 3 | 0 | 5 | 2 |
65 | 1 | 60 | 0 | 3.0 | 0 | 1 | 3.97 | 3.77 | 0.75 | 45 | 0 | 4.0 | 0 | 0 | 3 | 3 |
65 | 1 | 20 | 1 | 3.0 | 0 | 1 | 4.00 | 3.83 | 2.00 | 30 | 20 | 3.0 | 1 | 0 | 6 | 2 |
66 | 0 | 11 | 0 | 2.0 | 1 | 3 | 4.00 | 3.70 | 2.00 | 120 | 0 | 0.0 | 0 | 0 | 5 | 2 |
62 | 1 | 30 | 1 | 2.3 | 0 | 1 | 3.60 | 2.50 | 6.00 | 30 | 60 | 2.0 | 4 | 0 | 5 | 2 |
You can download the GA Student Survey dataset here
Variable Description
Height - height measured in inches
Gender - Gender recorded as female = 0, male = 1
Haircut - Cost (in whole dollars) willing to spend on a haircut
Job - If the student is employed (Yes = 1, No = 0)
Studytime - Average time spent studying over a week (recorded in hours)
Smokecig - If the student smokes recorded as No = 0 and Yes = 1
Dated - Number of romantic dates in a typical week
HSGPA - High school Grade Point Average (GPA)
CGPA - College Grade Point Average (GPA)
HomeDist - Distance from home to collage (in miles)
BrowseInternet - Time spent each day browsing the web (recorded in minutes)
WatchTV - Time spent watching TV each day (recorded in minutes)
Exercise - Average time spent exercising over a week (recorded in hours)
ReadNewsP - Number of days a week the student reads the newspaper
Vegan - Whether the student is a vegan (Yes = 1, No = 0)
PoliticalDegree - Degree of conservatism or liberalism (1 = extremely liberal, 2 = liberal, 3 = slightly liberal, 4 = moderate, 5 = slightly conservative, 6 = conservative, 7 = extremely conservative)
PoliticalAff - The self reported political affilation of the student (1 = Democrat, 2 = Republican, 3 = Democrat)
#One of the first things we can do to investigate a dataset is to use the str() and summary() commands.
# --> The str() command returns the type of each variable as interpreted by R. In R, there are four main types. Characters, Integers, Numeric, and Factor. A character is a qualitative variable. A factor is a qualitative variable that is represented with numeric values like 0 or 1. Integers are quantative discrete variables and numerics are for quantative continuous variables.
# --> The summary() command will give a five number summary for every variable in the dataset. Be careful of qualitative variables recorded as integers. A five-number summary for these variables must be carefully interpreted
# Using str()
str(gss)
## 'data.frame': 59 obs. of 17 variables:
## $ Height : int 65 71 68 64 64 66 74 69 65 68 ...
## $ Gender : int 1 0 1 1 1 0 0 1 1 1 ...
## $ Haircut : int 25 12 4 0 50 10 0 18 55 41 ...
## $ Job : int 1 0 0 1 0 1 0 0 0 1 ...
## $ Studytime : num 7 2 4 3.5 4.5 3 3 8 5.5 5.5 ...
## $ Smokecig : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Dated : int 0 1 1 1 1 3 0 1 0 0 ...
## $ HSGPA : num 3.9 3.79 3 3.9 3.6 3.2 3.78 3.2 3.5 3.8 ...
## $ CGPA : num 3.3 3.13 3.6 3.5 3.5 3.75 3.47 2.8 2.88 3.28 ...
## $ HomeDist : num 1 5 15 0.75 40 5 1 40 7 3 ...
## $ BrowseInternet : int 60 20 60 10 30 30 0 10 15 10 ...
## $ WatchTV : int 120 120 90 120 120 120 0 60 90 60 ...
## $ Exercise : num 0 1 1 15 6 7 7 4 2.5 3 ...
## $ ReadNewsP : int 4 0 1 1 2 5 4 2 3 0 ...
## $ Vegan : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoliticalDegree: int 4 2 2 4 5 4 6 6 7 5 ...
## $ PoliticalAff : int 2 3 3 2 2 1 2 2 2 2 ...
# using summary()
summary(gss)
## Height Gender Haircut Job
## Min. :60.00 Min. :0.0000 Min. : 0.00 Min. :0.0000
## 1st Qu.:65.00 1st Qu.:0.0000 1st Qu.: 10.50 1st Qu.:0.0000
## Median :68.00 Median :1.0000 Median : 20.00 Median :0.0000
## Mean :67.25 Mean :0.7119 Mean : 27.75 Mean :0.3729
## 3rd Qu.:69.00 3rd Qu.:1.0000 3rd Qu.: 40.00 3rd Qu.:1.0000
## Max. :74.00 Max. :1.0000 Max. :150.00 Max. :1.0000
## Studytime Smokecig Dated HSGPA
## Min. : 2.000 Min. :0.00000 Min. :0.000 Min. :2.550
## 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:3.765
## Median : 4.500 Median :0.00000 Median :1.000 Median :3.900
## Mean : 5.022 Mean :0.05085 Mean :0.678 Mean :3.802
## 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :15.000 Max. :1.00000 Max. :3.000 Max. :4.000
## CGPA HomeDist BrowseInternet WatchTV
## Min. :2.500 Min. : 0.5 Min. : 0.00 Min. : 0.00
## 1st Qu.:3.495 1st Qu.: 1.0 1st Qu.: 15.00 1st Qu.: 30.00
## Median :3.750 Median : 2.5 Median : 30.00 Median : 60.00
## Mean :3.612 Mean : 8.0 Mean : 46.53 Mean : 65.85
## 3rd Qu.:3.860 3rd Qu.: 6.5 3rd Qu.: 60.00 3rd Qu.:105.00
## Max. :4.000 Max. :70.0 Max. :180.00 Max. :180.00
## Exercise ReadNewsP Vegan PoliticalDegree
## Min. : 0.000 Min. :0.000 Min. :0.0000 Min. :1.000
## 1st Qu.: 3.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:4.000
## Median : 4.000 Median :2.000 Median :0.0000 Median :5.000
## Mean : 5.949 Mean :2.593 Mean :0.1017 Mean :4.593
## 3rd Qu.: 6.250 3rd Qu.:4.000 3rd Qu.:0.0000 3rd Qu.:6.000
## Max. :60.000 Max. :7.000 Max. :1.0000 Max. :7.000
## PoliticalAff
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.119
## 3rd Qu.:2.500
## Max. :3.000
#======================================================================
#-----------------Description of a qualitative variable----------------
#======================================================================
#To demonstrate how to use R to create descriptive plots for qualitative variables, we will analyze the variable Gender. Note that there are several packages in R that can be used for plotting. R has a set of basic graphics that allow for plotting bar plots, boxplots, scatterplots and so on. However, it's most popular to use the ggplot2 package for high quality graphics. ggplot2 has a bit of a learning curve but it has a lot of flexibility for editing and customizing plots. The examples provided herein will be done with both basic R plotting and also with ggplot2
# the following packages are used in this demonstration:
library(ggplot2)
library(ggthemes)
library(ggpubr)
#to tell R that this is qualitative variable we must convert it to a "factor" this tells R to treat the 0 and 1 values as categories
gss$Gender = as.factor(gss$Gender)
# now the variable will be reported as a factor with two levels corresponding to the two genders in the data
str(gss$Gender)
## Factor w/ 2 levels "0","1": 2 1 2 2 2 1 1 2 2 2 ...
# applying the summary() commmand to a factor will count the number of observations in each factor level
summary(gss$Gender)
## 0 1
## 17 42
#For both bar chart and the pie chart we need to supply the unique values and their proportions. We can compute the proportions (relative frequencies) by first computing the frequency using the summary() command
#frequency
freq = summary(gss$Gender)
#relative frequency is then the frequency divided by the sample size (here rounded to the nearest hundredth)
relative.freq = round(freq/sum(freq), 2)
#now we will repackage the variable Gender as a frequency table:
ft = cbind.data.frame(Gender = c('Female', 'Male'), Proportion = relative.freq)
print(ft)
## Gender Proportion
## 0 Female 0.29
## 1 Male 0.71
# We will first demonstrate a pie chart and barplot using the basic plotting mechanics in R. You can use ?pie and ?barplot to read the documentation for these functions and learn more about the arguments
# first argument is a set of non-negative values such as counts or proportions
pie(ft$Proportion,
#supply lables
labels = ft$Gender,
#edit colors
col = c('red','orange'))
#first argument is the height of the bars
barplot(height = ft$Proportion,
#supply the legend labels
legend = ft$Gender,
#color the bars
col = c('red','orange'),
#additional arguments to change the position of the legend
args.legend=list(
x=1,
y=1,
bty = "n"
))
#Any ggplot is built up from adding together different plot components. The general recipe is a basic plot frame ggplot() plus geometry elements that correspond to the custom plot components you want. Examples include:
#geom_bar()
#geom_boxplot()
#geom_histogram()
#geom_point()
#geom_line()
#geom_dot()
#geom_step()
# and many many more...
#many of these components can be combined in differet ways.
# every plot must include an aes() argument which is called the asthetics mapping. This tells ggplot which variable will be the x,y,and z coordinates, as well as coloring schemes and sizes of points or lines.
#Pie chart using the ggplot2 package
pie1 = ggplot(ft, aes(x = "", y = Proportion, fill = Gender)) +
# geom_col is another type of barplot offered by ggplot2
geom_col(color = "black") +
#label the proportions onto plot
geom_text(aes(label = Proportion),
position = position_stack(vjust = 0.6), size = 5) +
#convert bar plot into pie chart
coord_polar(theta = "y") +
# use reds palette
scale_fill_brewer(palette = 'Reds') +
#use void theme to remove all axes from plot area
theme_void()+
# change legend text side
theme(legend.text = element_text(size = 12))+
# Add a title
ggtitle('Gender')
plot(pie1)
# Similarly, the ggplot can make barplots from the the same frequency table we created above:
#bar chart
BP1 = ggplot(data = ft, aes(x = Gender, y = Proportion, fill = Gender))+
#apply bar chart
geom_bar(stat = 'identity', color = 'black', size = 0.8)+
#use a base plotting theme - HC theme
theme_hc()+
#change the tick points of the y axis
scale_y_continuous(breaks = seq(0, 0.5, 0.1))+
#use reds color palette
scale_fill_brewer(palette = 'Reds')+
#set plot legend to top of plot
theme(legend.position = 'top')+
#the theme command controls various asthetics related to the plot area, margin, and text
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14))+
# Add a title to the plot
ggtitle("Gender")
plot(BP1)
#======================================================================
#----------------Description of a quantitative variable----------------
#======================================================================
#For a description of a quantitative variable we will analyze the variable Studytime which is the self recorded amount of time (in hours) that Georgia college students spent studying each week.
#We will start by making a simple stem plot. R has a nice built in function called stem() which makes this process very simple and easy.
print(gss$Studytime)
## [1] 7.0 2.0 4.0 3.5 4.5 3.0 3.0 8.0 5.5 5.5 5.0 7.0 9.0 4.0 10.0
## [16] 3.0 3.0 3.5 4.5 3.5 2.0 6.0 6.0 4.0 3.0 4.0 10.0 4.5 7.0 5.0
## [31] 3.0 8.0 5.0 8.0 4.5 2.0 3.0 6.0 10.0 2.0 12.0 5.0 4.0 3.0 4.0
## [46] 4.0 4.0 5.0 15.0 4.5 4.0 4.0 5.0 5.0 6.0 3.0 3.0 2.0 2.3
#Looking at the values of the data, we can see that time is recorded to the nearest half hour.
#Stem plot
stem(gss$Studytime)
##
## The decimal point is at the |
##
## 2 | 0000030000000000555
## 4 | 000000000055555000000055
## 6 | 0000000
## 8 | 0000
## 10 | 000
## 12 | 0
## 14 | 0
#by default stem() will try to choose the "prettiest" way to make the stem chart using an internal way of estimating the number of "bins"
#to control this you can set the "scale" argument. we will remake the stem plot using a scale value of 2 - this will prevent the rounding to the nearest even value that is seen in the previous plot
stem(gss$Studytime, scale = 2)
##
## The decimal point is at the |
##
## 2 | 000003
## 3 | 0000000000555
## 4 | 000000000055555
## 5 | 000000055
## 6 | 0000
## 7 | 000
## 8 | 000
## 9 | 0
## 10 | 000
## 11 |
## 12 | 0
## 13 |
## 14 |
## 15 | 0
#R base graphics does not have a good equivalent function for making dotplots...
#Dot plot using ggplot
ggplot(data = gss, aes(x = Studytime))+
geom_dotplot(dotsize = 0.5)+
theme_classic()+
#remove y lable
ylab("")+
#set x label
xlab("Study Time (hrs)")+
#give plot a title and subtitle
ggtitle('Dotplot of Student Study Time (Hrs)', subtitle = 'Georgia Student Survey')+
#rescale x axis to show more tick marks
scale_x_continuous(limits = c(min(gss$Studytime), max(gss$Studytime)),
breaks = seq(min(gss$Studytime), max(gss$Studytime), 1))+
theme(axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14),
#remove y axis and background
panel.background = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line.y = element_blank())
#Next we will demonstrate how to construct histograms using base R and ggplot2. Note that both packages will usually automatically determine the number of bins to use. The hist() function in R uses the Sturges method as default which is a resonable choice for selecting the number of bins that we discussed in class. you can use ?hist to find out more about the input arguments for this function.
#Histogram of studytime using Base R graphics
hist(gss$Studytime,
xlab = 'Study Time (hrs)',
ylab = 'Frequency',
main = 'College Student Weekly Time Spent Studying')
#Note that geom_histogram will try to use 30 bins as its default. If you want to choose a custom number of bins or binwidths you can use additional arguments. Use ?geom_histogram() to see all arguments
#Histogram using ggplot2:
ggplot(data = gss, aes(x = Studytime))+
geom_histogram(color = 'black', fill = 'lightblue')+
theme_classic2()+
ylab("Frequency")+
xlab("Study Time (hrs)")+
theme(axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14))
# Histogram using ggplot2 and with simple square method for the number of bins
#find the number of observations
n = length(gss$Studytime)
k = ceiling(sqrt(n)) # rounded up square root of the sample size
ggplot(data = gss, aes(x = Studytime))+
geom_histogram(color = 'black', fill = 'lightblue', bins = k)+
theme_classic2()+
ylab("Frequency")+
xlab("Study Time (hrs)")+
theme(axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14))
#Next we demonstrate the boxplot using Base R graphics and ggplot2
#Boxplot using base R
boxplot(gss$Studytime,
# make the boxplot horizontal instead of vertical
horizontal = T,
# set axis label
xlab = 'Study Time (Hrs)',
# set plot title
main = 'College Student Weekly Study Time')
#if you want to label the boxplot you can use the text() function (this function is actually applicable to all base R plots)
#find min, Q1, median, Q3, and max
minss = min(gss$Studytime)
Q1 = quantile(gss$Studytime, 0.25, type = 2)
medss = median(gss$Studytime)
Q3 = quantile(gss$Studytime, 0.75, type = 2)
maxss = max(gss$Studytime)
#now group all estims into a vector
estimates = c(minss, Q1, medss, Q3, maxss)
boxplot(gss$Studytime,
# make the boxplot horizontal instead of vertical
horizontal = T,
# set axis label
xlab = 'Study Time (Hrs)',
# set plot title
main = 'College Student Weekly Study Time')
# the estimates will serve as the x-coords on the plot
text(x=estimates,
# create the labels using the paste function
labels =paste(c("Min", "Q1","Median", "Q3", "Max"), estimates, sep = "="),
# set the y-positions of the labels - labels will be offset so they dont overlap
y=c(1.3,1.25,1.3,1.25,1.3), cex = 0.6)
#also add the IQR
text(x=medss, labels =paste("IQR", Q3 - Q1, sep = " = "), y=1.5, cex = 0.7)
#Now using ggplot2 - Note that adding labels on ggplot2 is a bit more complicated but it can be done using the geom_label or geom_text functions
minss = min(gss$Studytime)
Q1 = quantile(gss$Studytime, 0.25, type = 2)
medss = median(gss$Studytime)
Q3 = quantile(gss$Studytime, 0.75, type = 2)
maxss = max(gss$Studytime)
ggplot()+
geom_boxplot(aes(x = gss$Studytime),
fill = 'lightgrey',
#choose shape of outlier points
outlier.shape = 23,
#choose fill color
outlier.fill = 'green',
# choose size
outlier.size = 5)+
# add min and max values as points
geom_point(aes(x = c(minss, maxss), y = c(0, 0)),
shape = 21,
fill = 'red',
size = 4)+
# label the min, Q1, median, Q3, and max
geom_label(aes(x = minss, y = 0.05), label = paste('min =', minss, collapse = ''),
label.size = 0.3,)+
geom_label(aes(x = maxss, y = 0.05), label = paste('max =', maxss, collapse = ''),
label.size = 0.3,)+
geom_label(aes(x = medss, y = 0.45), label = paste('median =', medss, collapse = ''),
label.size = 0.3,)+
geom_label(aes(x = Q1, y = 0.405), label = paste('Q1 =', Q1, collapse = ''),
label.size = 0.3,)+
geom_label(aes(x = Q3, y = 0.405), label = paste('Q3 =', Q3, collapse = ''),
label.size = 0.3,)+
geom_label(aes(x = medss, y = 0.495), label = paste('IQR =', Q3-Q1, collapse = ''),
label.size = 0.3)+
scale_x_continuous(limits = c(minss-0.5,
maxss+0.5),
n.breaks = 10)+
theme_classic()+
#remove y axis and change x-axis text size
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 12),
axis.title.y = element_blank(),
axis.line.y = element_blank())+
xlab('Study Time (Hrs)')+
ggtitle('College Student Weekly Study Time')