Chapter 3 R Programming

3.1 What is R?

The R statistical programming language is a free, open source package based on the S language developed by John Chambers.

3.1.1 Some history of R and S

S was further developed into R by Robert Gentlemen (Canada) and Ross Ihaka (New Zealand)

R Inventors

Figure 3.1: R Inventors

Source: Nick Thieme. 2018. R Generation: 25 years of R

3.1.2 It is:

  • Large, probably one of the largest based on the user-written add-ons/procedures
  • Object-oriented
  • Interactive
  • Multiplatform: Windows, Mac, Linux

According to John Chambers (2009), six facets of R:

  1. an interface to computational procedures of many kinds;
  2. interactive, hands-on in real time;
  3. functional in its model of programming;
  4. object-oriented, “everything is an object”;
  5. modular, built from standardized pieces; and,
  6. collaborative, a world-wide, open-source effort.
Prominent R Developers

Figure 3.2: Prominent R Developers

Source: Nick Thieme. 2018. R Generation: 25 years of R

3.2 Why R?

  • A programming platform environment
  • Allow development of software/packages by users
  • Currently, the CRAN package repository features over 14,000 available packages (as of May, 2019).
  • Graphics!!!
  • Scaleble and Portable
  • Interface with other platform/langauges (e.g. C++, Python, JavaScript, Stan, SQL)
  • Comparing R with other software?
R Compared with other statistical programs/platforms

Figure 3.3: R Compared with other statistical programs/platforms

Source: Oscar Torres-Reyna. 2010. Getting Started in R~Stata Notes on Exploring Data

3.3 RStudio

RStudio is a user interface for the statistical programming software R.

  • Object-based environment
  • Window system
  • Point and click operations
  • Coding recommended
  • Expansions and development
  • a multi-functional Integrated Development Environment (IDE)
RStudio screenshot

Figure 3.4: RStudio screenshot

3.4 Basic operations and object assignment

Arithmetic Operations:

+, -, *, /, ^ are the standard arithmetic operators.

Assignment

To assign a value to a variable use “<-” or “=”:

## Introduction to R sample program 
## file: introR02.R
## Adapted from Venables, W.N., Smith, D.M. and Team, R.C., 2018. An Introduction to R, Version 3.5.1 (2018-07-02)


# Clear any existing objects 
rm(list = ls())

# Generate x, y and w to demontrate linear models and plots.
# Make x = (1,2,...,20).

x <- 1:20

# Create A ‘weight’ vector of standard deviations.

w <- 1 + sqrt(x)/2

# Create a data frame of two columns, x and y.

dummy <- data.frame(x=x, y= x + rnorm(x)*w)

# Fit a simple linear regression 
# With y to the left of the tilde then x, meaning y being dependent on x.
# Unlike other statistical packages, R does not display all output.  It is recommended
# to create an object to store the estimates.

fm <- lm(y ~ x, data=dummy) 

# Display the summary of the output of model fm.

summary(fm)
## 
## Call:
## lm(formula = y ~ x, data = dummy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4741 -2.0033  0.5005  1.9943  5.5000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5724     1.4591  -1.078    0.295    
## x             1.0912     0.1218   8.959 4.72e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.141 on 18 degrees of freedom
## Multiple R-squared:  0.8168, Adjusted R-squared:  0.8066 
## F-statistic: 80.26 on 1 and 18 DF,  p-value: 4.716e-08
# Use w for a weighted regression.

fm1 <- lm(y ~ x, data=dummy, weight=1/w^2) 

# Display the summary of the output of model fm1.

summary(fm1)
## 
## Call:
## lm(formula = y ~ x, data = dummy, weights = 1/w^2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7866 -0.8068  0.2128  0.8996  2.0073 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.677      1.138  -1.474    0.158    
## x              1.101      0.116   9.493 1.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.257 on 18 degrees of freedom
## Multiple R-squared:  0.8335, Adjusted R-squared:  0.8243 
## F-statistic: 90.12 on 1 and 18 DF,  p-value: 1.977e-08
# Make the columns in the data frame visible as variables.

attach(dummy)

# Make a nonparametric local regression function. 

lrf <- lowess(x, y)

# Standard point plot, with plotting character (pch) as bullet.

plot(x, y,pch=20) 

# Add in the local regression.

lines(x, lrf$y)

# The true regression line: (intercept 0, slope 1, with dotted line type )

abline(0, 1, lty=3)

# Unweighted regression line.

abline(coef(fm))

# Weighted regression line.

abline(coef(fm1), col = "red")

# A standard regression diagnostic plot to check for heteroscedasticity. Can you see it?

plot(fitted(fm), pch=20, resid(fm), xlab="Fitted values", ylab="Residuals", main="Residuals vs Fitted")

# How about now?

abline(0,0, col="red")  

# A normal scores plot to check for skewness, kurtosis and outliers.

qqnorm(resid(fm), main="Residuals Rankit Plot", pch=17)

# Cleaning up

rm(list = ls())

3.5 Illustration

In this section, we demonstrate exploring data about Taiwan elections in 2016. The Taiwan Election and Democratization Study (TEDS) is one of the longest and most comprehensive elections studies starting in 2001. TEDS collects data through different modes of surveys including face-to-face interviews, telephone interviews and internet surveys. More detail of TEDS can be found at the National Chengchi University Election Study Center website at https://esc.nccu.edu.tw/main.php.

Taiwan Election and Democratization Study 2016 data

# Import the TEDS 2016 data in Stata format using the haven package
## install.packages("haven")

library(haven)
TEDS_2016 <- read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")

# Prepare the analyze the Party ID variable 
# Assign label to the values (1=KMT, 2=DPP, 3=NP, 4=PFP, 5=TSU, 6=NPP, 7="NA")

TEDS_2016$PartyID <- factor(TEDS_2016$PartyID, labels=c("KMT","DPP","NP","PFP", "TSU", "NPP","NA"))

Take a look at the variable:

# Check the variable
attach(TEDS_2016)
head(PartyID)
## [1] NA  NA  KMT NA  NA  DPP
## Levels: KMT DPP NP PFP TSU NPP NA
tail(PartyID)
## [1] NA  NA  DPP NA  NA  NA 
## Levels: KMT DPP NP PFP TSU NPP NA

Frequency table:

# Run a frequency table of the Party ID variable using the descr package
## install.packages("descr")
library(descr)
freq(TEDS_2016$PartyID)

## TEDS_2016$PartyID 
##       Frequency  Percent
## KMT         388  22.9586
## DPP         591  34.9704
## NP            3   0.1775
## PFP          32   1.8935
## TSU           5   0.2959
## NPP          43   2.5444
## NA          628  37.1598
## Total      1690 100.0000

Get a better chart of the Party ID variable:

# Plot the Party ID variable
ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar()

We can attend to more detail of the chart, such as adding labels to x and y axes, and calculating the percentage instead of counts.

ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties")

Adding colors, with another theme:

ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw()

Hold on, colors are not right!

ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))

To make the chart more meaningful, we can use a package called tidyverse to manage the data.

##install.packages("tidyverse")
library(tidyverse)
TEDS_2016 %>% 
  count(PartyID) %>% 
  mutate(perc = n / nrow(TEDS_2016)) -> T2
ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) + 
  geom_bar(stat = "identity") +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))

3.6 Exercise

Analyze the Tondu (統獨)variable using the following procedures:

  1. Assign label to each category
  2. Run a frequency table using descr
  3. Plot the variable using ggplot2

Hint:

  • Prepare the analyze the Tondu variable using these labesl: (“Unification now”,“Status quo, unif. in future”,“Status quo, decide later”,“Status quo forever”, “Status quo, indep. in future”, “Independence now”,“No response”)
  • Sample codes:
TEDS_2016$Tondu<-factor(TEDS_2016$Tondu,labels=c("Unification now","Status quo, unif. in >future","Status quo, decide later","Status quo forever", >"Status quo, indep. in future", "Independence now","No >response"))