R Coding Exercise

Purpose

The following is an R coding excerise on loading, viewing, processing, and visualizing data in R. We complete all analysis on the gapminder dataset.

All code chunks are echoed for ease of viewing.

The following chunk loads all libraries which will be used during this exercise.

## install packages and load required libraries

# install.packages("dslabs") #lines to install required packages are commented
library(dslabs) #load dslabs library

# install.packages("tidyverse") #install and load tidyverse
library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("here")
library(here)
here() starts at /Users/chirst/Desktop/Git/corahirst-MADA-portfolio
# install.packages("gtsummary")
library(gtsummary)
#StandWithUkraine
# install.packages("flextable")
library(flextable)

Attaching package: 'flextable'

The following objects are masked from 'package:gtsummary':

    as_flextable, continuous_summary

The following object is masked from 'package:purrr':

    compose

Loading and checking the data

Below, we familiarize ourselves with the most recent dataset from the package gapminder.

# pull information on the gapminder package
help(gapminder) #open help file for gapminder dataset

# structure of the gapminder dataset?
str(gapminder) # what is the class of the gapminder dataset? how many variables? class of each variable?
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
# summarize the gapminder data
summary(gapminder) #gives summary statistics for each numeric variable and count information for each categorical variable
                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586  
# although the str() function gives the dataset class, we can check with the function class()
class(gapminder) #gives the class of the r object gapminder
[1] "data.frame"

The gapminder dataset is an object of class “data.frame”. Each row of the dataframe contains health and income outcomes for some year for some country. The dataframe contains 9 columns, for country name, region, and continent; the year of the observation; and 5 different demographic, health, and income outcomes for each observation. These outcomes consist of infant_mortality, life_expectancy, fertility, population, and gdp.

There are 10545 observations in total, gathered from 184 countries between the years of 1960 and 2016.

Processing the data

Here, we are going to select and process data on african countries from the gapminder dataset.m

# subset rows from dataset containing factor level 'Africa' from column "continent"
africadata = gapminder %>% filter(continent == "Africa") #dplyr's filter() function accomplishes this

# create life expectancy subsets of africadata containing 1) infant mortality and life expectancy and 2) population and life expectancy 
africa.IMvsLE = africadata %>% select("infant_mortality", "life_expectancy") #dplyrs select() function subsets columns
africa.PvsLE = africadata %>% select("population", "life_expectancy") 

# structure of the life expectancy subsets
str(africa.IMvsLE) #2907 observations, 2 columns (variables)
'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
str(africa.PvsLE) #2907 observations, 2 columns (variables)
'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
# summary of variables in life expectancy subsets
summary(africa.IMvsLE) #summary stats of both columns containing numerical values
 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226                     
summary(africa.PvsLE) #summary stats of both columns contain numerical values
   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51                         

The code chunk above isolates the 2907 observations from african countries only within the gapminder dataset and stores this subset in the “data.frame” object, africandata. Two subsets of the africandata data frame are generated. The first, africa.IMvsLE, contains two columns showing the values of variables infant_mortality and life_expectancy for each observation. The second, africa.PvsLE, contains two columns showing the values of variables population and life_expectancy for each observation.

Plotting

The following code chunk generates two plots, visualizing the association of either infant_mortality or population with life_expectancy across Africa between 1960-2016.

#life_expectancy vs infant_mortality 

plot1 = ggplot() + geom_point(data = africa.IMvsLE, aes(x = infant_mortality, y = life_expectancy)) + 
  labs(x = "infant mortality", y = "life expectancy", title = "correlation between life expectancy and infant mortality in Africa, 1960 - 2016")

plot2 = ggplot() + geom_point(data = africa.PvsLE, aes(x = population, y = life_expectancy)) + 
  scale_x_continuous(trans = "log") +
  labs(x = "population", y = "life expectancy", title = "correlation between life expectancy and population size, 1960-2016")

figure_file = here("coding-exercise","results", "figures","life_exp_inf_mort_all.png")
ggsave(filename = figure_file, plot=plot1) 
Saving 7 x 5 in image
Warning: Removed 226 rows containing missing values (`geom_point()`).
figure_file = here("coding-exercise","results", "figures","life_exp_pop_all.png")
ggsave(filename = figure_file, plot=plot2) 
Saving 7 x 5 in image
Warning: Removed 51 rows containing missing values (`geom_point()`).

Figure 1: life expectancy by infant mortality, Africa, 1960-2016.

Figure 2: life expectancy by population size, Africa, 1960-2016.

Obviously, there appear to be clusters of data associated with the variables infant_mortality and population. These clusters may be the result of fixed effects from another associated variable, such as year.

The code chunk below regenerates these figures, with points colored by the year during which the observations were reported. Notice that I am now using the africadata subset to generate these figures, as this dataset also contains the year variable.

## life_expectancy vs infant_mortality stratified by year
plot3 = ggplot() + geom_point(data = africadata, aes(x = infant_mortality, y = life_expectancy, col = as.factor(year))) + 
  labs(x = "infant mortality", y = "life expectancy", title = "correlation between life expectancy and infant mortality in Africa, 1960 - 2016")

plot4 = ggplot() + geom_point(data = africadata, aes(x = population, y = life_expectancy, col = as.factor(year))) + 
  scale_x_continuous(trans = "log") +
  labs(x = "population", y = "life expectancy", title = "correlation between life expectancy and population size, 1960-2016")

figure_file = here("coding-exercise","results", "figures","life_exp_inf_mort_byyears.png")
ggsave(filename = figure_file, plot=plot3) 
Saving 7 x 5 in image
Warning: Removed 226 rows containing missing values (`geom_point()`).
figure_file = here("coding-exercise","results", "figures","life_exp_pop_byyears.png")
ggsave(filename = figure_file, plot=plot4) 
Saving 7 x 5 in image
Warning: Removed 51 rows containing missing values (`geom_point()`).

Figure 3: life expectancy by infant mortality, Africa, by year.

Figure 4: life expectancy by population size, Africa, by year.

Here, we see that “lines” in our data demonstrate how population size and infant mortality each correlate with year. However, this is visually difficult to disentangle. What we can see, particularly from our second plot, is that life expectancy appears somewhat constant across a range of population sizes within the same year.

Let’s recreate these plots, but only choose observations from a single year.

More processing

The infant_mortality variable within africandata contains many NAs. The code below determines which years contain missing (NA) data for infant mortality.

## determine which years contain missing infant mortality data
missing_IM_years = levels(factor(africadata[is.na(africadata$infant_mortality), "year"])) # which levels of factor "year" contain missing data for vairable "infant mortality"?

print(paste("Infant mortality data is missing from the years", paste(missing_IM_years, collapse = " "))) #print these years
[1] "Infant mortality data is missing from the years 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 2016"

Infant mortality data is missing from observations between 1960-1981, and again from 2016. Between 1981 and 2016, then, every observation from each year contains infant mortality data. So, we will plot life_expectancy against infant_mortality from observations taken during a year within that time frame, say, the year 2000.

More plotting

The variable population is not missing data from the year 2000, so we can also plot life expectancy against population for observations taken in 2000, as well.

## extract observations from the year 2000 only and save in a new dataframe
africadata_2000 = africadata %>% filter(year == 2000)

## check 
str(africadata_2000) #all variables from africadata included
'data.frame':   51 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(africadata_2000) #and each variable is of the same class as corresponding africadata variables
         country        year      infant_mortality life_expectancy
 Algeria     : 1   Min.   :2000   Min.   : 12.30   Min.   :37.60  
 Angola      : 1   1st Qu.:2000   1st Qu.: 60.80   1st Qu.:51.75  
 Benin       : 1   Median :2000   Median : 80.30   Median :54.30  
 Botswana    : 1   Mean   :2000   Mean   : 78.93   Mean   :56.36  
 Burkina Faso: 1   3rd Qu.:2000   3rd Qu.:103.30   3rd Qu.:60.00  
 Burundi     : 1   Max.   :2000   Max.   :143.30   Max.   :75.00  
 (Other)     :45                                                  
   fertility       population             gdp               continent 
 Min.   :1.990   Min.   :    81154   Min.   :2.019e+08   Africa  :51  
 1st Qu.:4.150   1st Qu.:  2304687   1st Qu.:1.274e+09   Americas: 0  
 Median :5.550   Median :  8799165   Median :3.238e+09   Asia    : 0  
 Mean   :5.156   Mean   : 15659800   Mean   :1.155e+10   Europe  : 0  
 3rd Qu.:5.960   3rd Qu.: 17391242   3rd Qu.:8.654e+09   Oceania : 0  
 Max.   :7.730   Max.   :122876723   Max.   :1.329e+11                
                                                                      
                       region  
 Eastern Africa           :16  
 Western Africa           :16  
 Middle Africa            : 8  
 Northern Africa          : 6  
 Southern Africa          : 5  
 Australia and New Zealand: 0  
 (Other)                  : 0  
## plot
## life_expectancy vs infant_mortality in the year 2000
plot5 = ggplot() + geom_point(data = africadata_2000, aes(x = infant_mortality, y = life_expectancy)) + 
  geom_smooth(data = africadata_2000, aes(x = infant_mortality, y = life_expectancy), method = 'lm') +
  labs(x = "infant mortality", y = "life expectancy", title = "correlation between life expectancy and infant mortality in Africa, year 2000")

plot6 = ggplot() + geom_point(data = africadata_2000, aes(x = population, y = life_expectancy)) + 
  scale_x_continuous(trans = "log") +
  geom_smooth(data = africadata_2000, aes(x = population, y = life_expectancy), method = 'lm') +
  labs(x = "population", y = "life expectancy", title = "correlation between life expectancy and population size, year 2000")

figure_file = here("coding-exercise","results", "figures","life_exp_inf_mort_2000.png")
ggsave(filename = figure_file, plot=plot5) 
Saving 7 x 5 in image
`geom_smooth()` using formula = 'y ~ x'
figure_file = here("coding-exercise","results", "figures","life_exp_pop_2000.png")
ggsave(filename = figure_file, plot=plot6) 
Saving 7 x 5 in image
`geom_smooth()` using formula = 'y ~ x'

Figure 5: life expectancy by infant mortality, Africa 2000.

Figure 6: life expectancy by population size, Africa 2000.

Are these correlations (linear regressions) significant?

Simple model fits

The following code chunk utilizes the lm() function in R to determine the best-fit, simple linear regression model with life_expectancy in the year 2000 as the outcome variable and either infant_mortality or population as the predictor.

# linear regression, fixed effect, life expectancy ~ infant_mortality
fit1 = lm(data = africadata_2000, life_expectancy ~ infant_mortality) #results saved in lm object `fit1`

# linear regression, fixed effect, life expectancy ~ infant_mortality
fit2 = lm(data = africadata_2000, life_expectancy ~ population) #results saved in lm object `fit2`

# print summary of lm objects fit1 and fit2
summary(fit1)

Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata_2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08
summary(fit2)

Call:
lm(formula = life_expectancy ~ population, data = africadata_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159
table1 = fit1 %>% tbl_regression() #save summaries as tables using tbl_regression() function from flextable package
table2 = fit2 %>% tbl_regression() #save summaries as tables using tbl_regression() function from flextable package

# save summary tables
summarytable_file = here("coding-exercise","results", "tables-files", "lm_life_exp_inf_mort.rds")
saveRDS(table1, file = summarytable_file)

summarytable_file = here("coding-exercise","results", "tables-files", "lm_life_exp_pop.rds")
saveRDS(table2, file = summarytable_file)

# extract p-values from summaries to display on figures
p.infant_mortality = unname(summary(fit1)$coefficients[,"Pr(>|t|)"])[2]
p.population = unname(summary(fit2)$coefficients[,"Pr(>|t|)"])[2]
Table 1: Linear regression model of life expectancy by infant mortality.
Characteristic Beta 95% CI p-value
infant_mortality -0.19 -0.25, -0.13 <0.001
Table 2: Linear regression model of life expectancy by population size.
Characteristic Beta 95% CI p-value
population 0.00 0.00, 0.00 0.6
## replot with pvals 
## life_expectancy vs infant_mortality in the year 2000
plot7 = ggplot() + geom_point(data = africadata_2000, aes(x = infant_mortality, y = life_expectancy)) + 
  geom_smooth(data = africadata_2000, aes(x = infant_mortality, y = life_expectancy), method = 'lm') +
  annotate(geom = "text", x = 125, y = 70, label = paste("p-value =", signif(p.infant_mortality, digits = 3)), color = "forestgreen") +
  labs(x = "infant mortality", y = "life expectancy", title = "correlation between life expectancy and infant mortality in Africa, year 2000")

plot8 = ggplot() + geom_point(data = africadata_2000, aes(x = population, y = life_expectancy)) + 
  scale_x_continuous(trans = "log") +
  geom_smooth(data = africadata_2000, aes(x = population, y = life_expectancy), method = 'lm') +
  annotate(geom = "text", x = 163000, y = 40, label = paste("p-value =", signif(p.population, digits = 3)), color = "maroon") +
  labs(x = "population", y = "life expectancy", title = "correlation between life expectancy and population size, year 2000")

figure_file = here("coding-exercise","results", "figures","life_exp_inf_mort_reg.png")
ggsave(filename = figure_file, plot=plot7) 
Saving 7 x 5 in image
`geom_smooth()` using formula = 'y ~ x'
figure_file = here("coding-exercise","results", "figures","life_exp_pop_reg.png")
ggsave(filename = figure_file, plot=plot8) 
Saving 7 x 5 in image
`geom_smooth()` using formula = 'y ~ x'

Figure 7: linear regression of life expectancy by infant mortality, Africa 2000.

Figure 8: linear regression of life expectancy by population size, Africa 2000.

Life expectancy appears to decrease with infant mortality, while decreasing only slightly with population size (Figure 7, Figure 8). Using a definition of \(pvalue < 0.05\) as significance, it is evident that population size is a poor predictor of life expectancy (\(p = 0.6 > 0.05\)). However, infant mortality is in fact a better, significant predictor of life expectancy (\(p = 2 \times 10^{-8}\)) (Table 1, Table 2). Thus, we fail to reject the null hypothesis that life expectancy is not correlated with population size, but do reject the null hypothesis that life expectancy is not correlated with infant mortality rates.

This section is contributed by Taylor Glass.

Exploring the dataset

I chose to explore the ‘movie’ dataset in the dslabs package. I used the help() function to find the description of the dataset and an explanation of the 7 variables it includes. I used the str() function to determine there are 100,004 observations of 7 variables. There are 4 integer variables, 1 character variable, 1 numeric variable and one factor variable with 901 levels. I used the summary() function to examine the range of each numerical variable. The year the movie was released ranges from 1902 to 2016. There are 6 distinct genres along with an ‘other’ option. The mean rating for all 100,004 movies included in the dataset is 3.544.

help("movielens") ##learn about the dataset
str(movielens) ##determine dimensions of dataset
'data.frame':   100004 obs. of  7 variables:
 $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
 $ title    : chr  "Dangerous Minds" "Dumbo" "Sleepers" "Escape from New York" ...
 $ year     : int  1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 ...
 $ genres   : Factor w/ 901 levels "(no genres listed)",..: 762 510 899 120 762 836 81 762 844 899 ...
 $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
 $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
summary(movielens) ##explore ranges for numerical variables
    movieId          title                year     
 Min.   :     1   Length:100004      Min.   :1902  
 1st Qu.:  1028   Class :character   1st Qu.:1987  
 Median :  2406   Mode  :character   Median :1995  
 Mean   : 12549                      Mean   :1992  
 3rd Qu.:  5418                      3rd Qu.:2001  
 Max.   :163949                      Max.   :2016  
                                     NA's   :7     
                  genres          userId        rating        timestamp        
 Drama               : 7757   Min.   :  1   Min.   :0.500   Min.   :7.897e+08  
 Comedy              : 6748   1st Qu.:182   1st Qu.:3.000   1st Qu.:9.658e+08  
 Comedy|Romance      : 3973   Median :367   Median :4.000   Median :1.110e+09  
 Drama|Romance       : 3462   Mean   :347   Mean   :3.544   Mean   :1.130e+09  
 Comedy|Drama        : 3272   3rd Qu.:520   3rd Qu.:4.000   3rd Qu.:1.296e+09  
 Comedy|Drama|Romance: 3204   Max.   :671   Max.   :5.000   Max.   :1.477e+09  
 (Other)             :71588                                                    

Processing/cleaning data

I decided to create a subset of this data with movie ratings from just the year that I was born, 2001. I used the filter function to create a new object called ‘mymovies’ with 3442 observations of 7 variables. I decided to explore the ratings among different genres of movies during 2001. I created a boxplot using genres as the categorical variable, which was very messy and not helpful. I decided to filter the datset further by only including movies in the drama and comedy genres, which limited the dataset to 319 observations of 7 variables. I selected the title, genres, and rating columns because they are the only variables useful for this analysis. When I created the second boxplot with the final dataset called ‘mymovies3’, it became easy to view the spread in ratings for drama and comedy movies released in 2001. The average rating for comedy movies is a 3, while the average rating for drama movies is a 4.

mymovies <- movielens %>% 
              filter(year == 2001) ##limit dataset to movies released in 2001
str(mymovies)
'data.frame':   3442 obs. of  7 variables:
 $ movieId  : int  4306 4308 4447 4718 4963 4995 4641 4975 4448 4886 ...
 $ title    : chr  "Shrek" "Moulin Rouge" "Legally Blonde" "American Pie 2" ...
 $ year     : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
 $ genres   : Factor w/ 901 levels "(no genres listed)",..: 332 808 687 596 747 820 624 883 246 329 ...
 $ userId   : int  5 5 5 5 5 5 6 6 8 8 ...
 $ rating   : num  3.5 3.5 4.5 3.5 3 4.5 1.5 1.5 2.5 3.5 ...
 $ timestamp: int  1163374103 1163374354 1163375033 1163375131 1163373548 1163373616 1109258217 1109258226 1154400334 1154464808 ...
mymovies %>% ggplot(aes(x=genres,  y=rating)) + ##use genres as the categorical variable and rating as the numerical variable
              geom_boxplot()   ##create exploratory boxplot 

mymovies2 <- mymovies %>% 
              filter(year == 2001 & genres %in% c("Drama", "Comedy")) ##further filter dataset to include 2 genres
str(mymovies2)
'data.frame':   319 obs. of  7 variables:
 $ movieId  : int  4718 4718 4225 4381 4388 4718 4723 4728 4816 4888 ...
 $ title    : chr  "American Pie 2" "American Pie 2" "Dish, The" "Closet, The (Placard, Le)" ...
 $ year     : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
 $ genres   : Factor w/ 901 levels "(no genres listed)",..: 596 596 596 596 596 596 762 596 596 762 ...
 $ userId   : int  5 13 15 15 15 15 15 15 15 15 ...
 $ rating   : num  3.5 3.5 1 2 3 4 3.5 4 3 3 ...
 $ timestamp: int  1163375131 1331379530 997937762 997937346 1416119736 1044348492 1054449465 1044348492 1166586875 1054449355 ...
mymovies3 <- mymovies2 %>%
              select("title", "genres", "rating") ##select 3 columns of useful information 
str(mymovies3)
'data.frame':   319 obs. of  3 variables:
 $ title : chr  "American Pie 2" "American Pie 2" "Dish, The" "Closet, The (Placard, Le)" ...
 $ genres: Factor w/ 901 levels "(no genres listed)",..: 596 596 596 596 596 596 762 596 596 762 ...
 $ rating: num  3.5 3.5 1 2 3 4 3.5 4 3 3 ...

Create a few exploratory figures

I created the second boxplot with the final dataset called ‘mymovies3’, and it became easy to view the spread in ratings for drama and comedy movies released in 2001. I added a title for the boxplot and centered it to make it easier to read. The average rating for comedy movies is a 3, while the average rating for drama movies is a 4. I practiced saving this image to the folder previously created by Cora using the here() and ggsave() functions. To create another visualization of this data, I used the mean ratings from each category in a bar graph. The means of the ratings for each genre are displayed with better precision in this visualization. The mean rating for comedy movies is a little less than 3 at about 2.9 while the average rating for drama movies seems lower at about 3.5. This discrepancy could be explained by the 2 outliers shown in the drama category on the boxplot because these observations are skewing the mean. I saved this graph to the same folder with the boxplot using the here() and ggsave() functions. I also created a table of my final dataset to include the movie titles of both genres with each of their ratings from reviewers. It is a little busy since there are multiple reviews for the some of the same movies, so I organized it by title to allow for quick review of all the ratings provided for each movie. The first table has been commented out, so it will not print on this rendered website. It is included to show the process of creating the better table.

boxplot <- mymovies3 %>% ggplot(aes(x=genres,  y=rating)) +   ##visualize the ratings for each genre of movie 
                geom_boxplot() +    ##create clean boxplot to compare ratings across genres in 2001
                labs(title = "Ratings for Drama and Comedy Movies in 2001 ") +  ##add a title to the graph
                theme(plot.title = element_text(hjust = 0.5)) ##center the title 

figure_file <- here("coding-exercise", "results", "figures", "movies_boxplot.png") ##inform R on where to save the image
ggsave(filename = figure_file, plot = boxplot) ##save the image in the predefined location 
Saving 7 x 5 in image
## ensure the figure renders on the website
#| label: fig-inf-mort-all.  ##create a label for the final figure
#| fig-cap: "life expectancy by infant mortality, Africa, 1960-2016."  ##create a label for the final figure
#| echo: FALSE  ##ensure these lines are not reproduced in the rendered version
knitr::include_graphics(here("coding-exercise","results","figures","movies_boxplot.png")) ##tag the saved location

bargraph <- mymovies3 %>% ggplot(aes(x=genres, y=rating)) + ##visualize the ratings for each genre of movie
                geom_bar(stat = "summary", fun = "mean") +  ##use the mean of each rating for the bars
                labs(title = "Ratings for Drama and Comedy Movies in 2001") + ##create a title for the graph
                theme(plot.title = element_text(hjust = 0.5)) ##center the title

figure_file2 <- here("coding-exercise", "results", "figures", "movies_bargraph.png") ##inform R on where to save the image
ggsave(filename = figure_file2, plot = bargraph) ##save the image in the predefined location
Saving 7 x 5 in image
## ensure the figure renders on the website
#| label: fig-inf-mort-all  ##create a label for the final figure
#| fig-cap: "life expectancy by infant mortality, Africa, 1960-2016."  ##create a label for the final figure
#| echo: FALSE  ##ensure these lines are not reproduced in the rendered version
knitr::include_graphics(here("coding-exercise","results","figures","movies_bargraph.png")) ##tag the saved location

## knitr::kable(mymovies3) ##create a table to display movie titles with their ratings
## comment out the first table so it will not show on the rendered website
better_table <- arrange(mymovies3, title) ##organize the table by title 
knitr::kable(better_table) ##view new table with all ratings for each movie grouped together by title
title genres rating
61* Drama 3.0
61* Drama 4.0
Ali Drama 3.5
Ali Drama 3.0
Ali Drama 3.0
Ali Drama 4.0
Ali Drama 2.5
Ali Drama 2.5
All Over the Guy Comedy 4.0
All Over the Guy Comedy 4.0
All Over the Guy Comedy 5.0
American Pie 2 Comedy 3.5
American Pie 2 Comedy 3.5
American Pie 2 Comedy 4.0
American Pie 2 Comedy 4.5
American Pie 2 Comedy 4.0
American Pie 2 Comedy 3.0
American Pie 2 Comedy 4.5
American Pie 2 Comedy 4.0
American Pie 2 Comedy 4.0
American Pie 2 Comedy 1.5
American Pie 2 Comedy 4.0
American Pie 2 Comedy 2.5
American Pie 2 Comedy 2.5
American Pie 2 Comedy 2.5
American Pie 2 Comedy 2.0
American Pie 2 Comedy 2.5
American Pie 2 Comedy 3.0
American Pie 2 Comedy 2.0
American Pie 2 Comedy 3.0
American Pie 2 Comedy 3.5
American Pie 2 Comedy 3.0
American Pie 2 Comedy 3.0
American Pie 2 Comedy 0.5
American Pie 2 Comedy 3.0
American Pie 2 Comedy 1.0
American Pie 2 Comedy 1.5
American Pie 2 Comedy 3.5
American Pie 2 Comedy 2.0
American Pie 2 Comedy 1.0
American Pie 2 Comedy 3.5
American Pie 2 Comedy 4.0
American Pie 2 Comedy 2.5
American Pie 2 Comedy 3.0
American Pie 2 Comedy 2.5
American Pie 2 Comedy 3.5
American Pie 2 Comedy 3.0
American Pie 2 Comedy 0.5
American Pie 2 Comedy 3.0
American Pie 2 Comedy 4.0
American Rhapsody, An Drama 2.0
American Rhapsody, An Drama 3.5
Animal, The Comedy 4.0
Animal, The Comedy 2.0
Animal, The Comedy 3.0
Animal, The Comedy 2.5
Animal, The Comedy 2.0
Animal, The Comedy 1.5
Animal, The Comedy 3.0
Animal, The Comedy 2.0
Anniversary Party, The Drama 4.0
Anniversary Party, The Drama 2.0
Anniversary Party, The Drama 3.0
Anniversary Party, The Drama 2.0
Anniversary Party, The Drama 4.5
Anniversary Party, The Drama 4.0
Bad Guy (Nabbeun namja) Drama 2.5
Believer, The Drama 4.0
Believer, The Drama 4.5
Believer, The Drama 4.5
Believer, The Drama 4.0
Blow Dry (a.k.a. Never Better) Comedy 4.5
Bubble Boy Comedy 2.5
Bubble Boy Comedy 4.0
Bubble Boy Comedy 1.5
Chelsea Walls Drama 2.0
Chelsea Walls Drama 4.0
Closet, The (Placard, Le) Comedy 2.0
Closet, The (Placard, Le) Comedy 4.0
Curse of the Jade Scorpion, The Comedy 4.0
Curse of the Jade Scorpion, The Comedy 4.0
Deep End, The Drama 3.5
Deep End, The Drama 4.0
Deep End, The Drama 4.0
Deep End, The Drama 4.0
Deep End, The Drama 4.0
Diamond Men Drama 3.5
Dish, The Comedy 1.0
Dish, The Comedy 4.0
Dish, The Comedy 4.0
Dish, The Comedy 4.0
Dish, The Comedy 4.5
Dish, The Comedy 4.5
Dish, The Comedy 4.0
Dish, The Comedy 2.0
Dish, The Comedy 3.0
Dish, The Comedy 4.0
Dish, The Comedy 4.0
Dr. Dolittle 2 Comedy 0.5
Dr. Dolittle 2 Comedy 0.5
Dr. Dolittle 2 Comedy 3.0
Dr. Dolittle 2 Comedy 4.0
Emperor’s New Clothes, The Comedy 4.5
Freddy Got Fingered Comedy 0.5
Freddy Got Fingered Comedy 1.5
Freddy Got Fingered Comedy 3.0
Freddy Got Fingered Comedy 2.0
Freddy Got Fingered Comedy 1.0
Freddy Got Fingered Comedy 3.5
Freddy Got Fingered Comedy 0.5
Friends and Family Comedy 5.0
Grey Zone, The Drama 3.5
Grey Zone, The Drama 4.0
Hardball Drama 4.0
Hearts in Atlantis Drama 4.0
Hearts in Atlantis Drama 3.0
How High Comedy 3.0
How High Comedy 2.0
How I Killed My Father (a.k.a. My Father and I) (Comment j’ai tué mon Père) Drama 1.0
Hush! Drama 4.0
I Am Sam Drama 1.0
I Am Sam Drama 4.5
I Am Sam Drama 4.0
I Am Sam Drama 3.5
I Am Sam Drama 3.5
I Am Sam Drama 3.0
I Am Sam Drama 4.5
I Am Sam Drama 3.5
I Am Sam Drama 4.5
I Am Sam Drama 2.5
I Am Sam Drama 2.5
I Am Sam Drama 3.5
I Am Sam Drama 3.0
I Am Sam Drama 4.5
In the Bedroom Drama 5.0
In the Bedroom Drama 3.0
In the Bedroom Drama 4.0
In the Bedroom Drama 4.0
In the Bedroom Drama 3.0
In the Bedroom Drama 5.0
In the Bedroom Drama 4.0
In the Bedroom Drama 5.0
In the Bedroom Drama 4.0
In the Bedroom Drama 4.0
Invincible Drama 1.5
Iris Drama 3.0
Iris Drama 3.0
Josie and the Pussycats Comedy 1.0
Josie and the Pussycats Comedy 4.0
Kandahar (Safar e Ghandehar) Drama 3.0
Kandahar (Safar e Ghandehar) Drama 4.0
Kandahar (Safar e Ghandehar) Drama 3.0
L.I.E. Drama 3.0
L.I.E. Drama 4.0
L.I.E. Drama 5.0
L.I.E. Drama 2.0
Lammbock Comedy 3.5
Life as a House Drama 4.5
Life as a House Drama 2.0
Life as a House Drama 5.0
Life as a House Drama 4.0
Life as a House Drama 4.0
Life as a House Drama 3.5
Life as a House Drama 3.5
Life as a House Drama 4.0
Lost and Delirious Drama 4.0
Lost and Delirious Drama 3.0
Made Comedy 2.0
Made Comedy 2.0
Made Comedy 3.0
Made Comedy 2.0
Made Comedy 4.0
Manic Drama 2.0
Not Another Teen Movie Comedy 2.0
Not Another Teen Movie Comedy 3.0
Not Another Teen Movie Comedy 3.0
Not Another Teen Movie Comedy 5.0
Not Another Teen Movie Comedy 3.5
Not Another Teen Movie Comedy 3.0
Not Another Teen Movie Comedy 2.5
Not Another Teen Movie Comedy 4.0
Not Another Teen Movie Comedy 3.0
Not Another Teen Movie Comedy 2.5
Not Another Teen Movie Comedy 1.0
Not Another Teen Movie Comedy 2.5
Not Another Teen Movie Comedy 3.0
Nowhere in Africa (Nirgendwo in Afrika) Drama 4.5
Nowhere in Africa (Nirgendwo in Afrika) Drama 4.5
Nowhere in Africa (Nirgendwo in Afrika) Drama 1.0
Nowhere in Africa (Nirgendwo in Afrika) Drama 4.0
Nowhere in Africa (Nirgendwo in Afrika) Drama 4.5
O Drama 3.0
O Drama 1.0
O Drama 0.5
On the Edge Drama 5.0
One Night at McCool’s Comedy 4.0
One Night at McCool’s Comedy 3.0
One Night at McCool’s Comedy 3.0
Piano Teacher, The (La pianiste) Drama 4.5
Piano Teacher, The (La pianiste) Drama 1.0
Piano Teacher, The (La pianiste) Drama 4.5
Piano Teacher, The (La pianiste) Drama 4.0
Piano Teacher, The (La pianiste) Drama 3.0
Piano Teacher, The (La pianiste) Drama 5.0
Piano Teacher, The (La pianiste) Drama 5.0
Piñero Drama 1.5
Piñero Drama 4.0
Rat Race Comedy 4.0
Rat Race Comedy 3.0
Rat Race Comedy 3.0
Rat Race Comedy 1.5
Rat Race Comedy 5.0
Rat Race Comedy 1.5
Rat Race Comedy 3.0
Rat Race Comedy 2.5
Scary Movie 2 Comedy 3.0
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 5.0
Scary Movie 2 Comedy 4.0
Scary Movie 2 Comedy 1.0
Scary Movie 2 Comedy 1.5
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 3.0
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 3.5
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 1.0
Scary Movie 2 Comedy 2.5
Scary Movie 2 Comedy 2.5
Scary Movie 2 Comedy 0.5
Scary Movie 2 Comedy 2.0
Scary Movie 2 Comedy 2.0
See Spot Run Comedy 1.0
Shipping News, The Drama 1.0
Shipping News, The Drama 3.0
Shipping News, The Drama 3.0
Shipping News, The Drama 2.0
Son’s Room, The (Stanza del figlio, La) Drama 3.0
Son’s Room, The (Stanza del figlio, La) Drama 4.0
Sugar & Spice Comedy 3.5
Sugar & Spice Comedy 3.0
Taking Sides Drama 2.0
Tape Drama 3.0
Tape Drama 4.0
Tape Drama 3.0
Tape Drama 4.5
The Last Brickmaker in America Drama 5.0
Things Behind the Sun Drama 4.0
Thirteen Conversations About One Thing (a.k.a. 13 Conversations) Drama 2.5
Thirteen Conversations About One Thing (a.k.a. 13 Conversations) Drama 4.0
Thirteen Conversations About One Thing (a.k.a. 13 Conversations) Drama 4.5
Thirteen Conversations About One Thing (a.k.a. 13 Conversations) Drama 5.0
Thirteen Conversations About One Thing (a.k.a. 13 Conversations) Drama 4.5
Thirteen Conversations About One Thing (a.k.a. 13 Conversations) Drama 3.5
To the Left of the Father (Lavoura Arcaica) Drama 5.0
Tomcats Comedy 2.5
Town & Country Comedy 2.0
Wash, The Comedy 2.5
Waterboys Comedy 4.0
Wet Hot American Summer Comedy 4.0
Wet Hot American Summer Comedy 4.0
Wet Hot American Summer Comedy 2.5
Wet Hot American Summer Comedy 2.0
Wet Hot American Summer Comedy 3.0
Wet Hot American Summer Comedy 1.0
What Time Is It There? (Ni neibian jidian) Drama 5.0
What Time Is It There? (Ni neibian jidian) Drama 4.0
White Sound, The (Das weiße Rauschen) Drama 5.0
Wit Drama 3.0
Wit Drama 4.0
Wit Drama 3.5
Zoolander Comedy 3.0
Zoolander Comedy 3.0
Zoolander Comedy 2.0
Zoolander Comedy 2.0
Zoolander Comedy 4.0
Zoolander Comedy 0.5
Zoolander Comedy 4.0
Zoolander Comedy 4.0
Zoolander Comedy 4.0
Zoolander Comedy 4.5
Zoolander Comedy 0.5
Zoolander Comedy 2.0
Zoolander Comedy 2.5
Zoolander Comedy 4.0
Zoolander Comedy 4.0
Zoolander Comedy 3.5
Zoolander Comedy 4.0
Zoolander Comedy 3.5
Zoolander Comedy 3.0
Zoolander Comedy 2.0
Zoolander Comedy 1.0
Zoolander Comedy 5.0
Zoolander Comedy 3.0
Zoolander Comedy 3.5
Zoolander Comedy 3.0
Zoolander Comedy 2.5
Zoolander Comedy 4.0
Zoolander Comedy 2.0
Zoolander Comedy 2.0
Zoolander Comedy 5.0
Zoolander Comedy 4.0
Zoolander Comedy 3.0
Zoolander Comedy 4.5
Zoolander Comedy 5.0
Zoolander Comedy 4.0
Zoolander Comedy 3.5
Zoolander Comedy 4.5
Zoolander Comedy 2.0
Zoolander Comedy 4.0
Zoolander Comedy 3.5
Zoolander Comedy 4.5
Zoolander Comedy 3.0
Zoolander Comedy 1.5
Zoolander Comedy 3.5
Zoolander Comedy 3.0
Zoolander Comedy 4.0

Simple statistical model

I decided to conduct a simple t test to determine if the mean ratings among movies in each genre are statistically different from one another. I created an object called ‘drama’ to pull the ratings for movies in the drama genre from the dataframe. I did the same thing for movies in the comedy genre and saved that object as ‘comedy’. I used the base R t.test() function to compare the average ratings between the two groups and printed the results.

drama <- mymovies3$rating[mymovies3$genres == "Drama"] ##create an object with ratings only for movies in drama genre
comedy <- mymovies3$rating[mymovies3$genres == "Comedy"] ##create an object with ratings only for movies in comedy genre
t.test1 <- t.test(drama, comedy) ##run a two-sample t-test between the two numeric vectors
print(t.test1) ##show the results of the t-test

    Welch Two Sample t-test

data:  drama and comedy
t = 4.6912, df = 279.12, p-value = 4.257e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.3433740 0.8398837
sample estimates:
mean of x mean of y 
 3.504000  2.912371 

Results of simple statistical model

After running the two-sample t-test, I get a test statistic of 4.6912 with a tiny p-value of 4.257e-06. Since the p-value is much less than 0.05, we can reject the null hypothesis to conclude that there is a statistically significant difference between the mean ratings of drama and comedy movies in 2001. It is clear that movies in the drama genre received higher ratings from viewers compared to movies in the comedy genre in 2001.