# Chapter 3 R Programming

## 3.1 What is R?

The R statistical programming language is a free, open source package based on the S language developed by John Chambers.

### 3.1.1 Some history of R and S

S was further developed into R by Robert Gentlemen (Canada) and Ross Ihaka (New Zealand)

### 3.1.2 It is:

- Large, probably one of the largest based on the user-written add-ons/procedures
- Object-oriented
- Interactive
- Multiplatform: Windows, Mac, Linux

According to John Chambers (2009), six facets of R:

- an interface to computational procedures of many kinds;
- interactive, hands-on in real time;
- functional in its model of programming;
- object-oriented, “everything is an object”;
- modular, built from standardized pieces; and,
- collaborative, a world-wide, open-source effort.

## 3.2 Why R?

- A programming platform environment
- Allow development of software/packages by users
- Currently, the CRAN package repository features over 14,000 available packages (as of May, 2019).
- Graphics!!!
- Scaleble and Portable
- Interface with other platform/langauges (e.g. C++, Python, JavaScript, Stan, SQL)
- Comparing R with other software?

Source: Oscar Torres-Reyna. 2010. Getting Started in R~Stata Notes on Exploring Data

## 3.3 RStudio

RStudio is a user interface for the statistical programming software R.

- Object-based environment
- Window system
- Point and click operations
- Coding recommended

- Expansions and development
- a multi-functional Integrated Development Environment (IDE)

## 3.4 Basic operations and object assignment

Arithmetic Operations:

+, -, *, /, ^ are the standard arithmetic operators.

Assignment

To assign a value to a variable use “<-” or “=”:

```
## Introduction to R sample program
## file: introR02.R
## Adapted from Venables, W.N., Smith, D.M. and Team, R.C., 2018. An Introduction to R, Version 3.5.1 (2018-07-02)
# Clear any existing objects
rm(list = ls())
# Generate x, y and w to demonstrate linear models and plots.
# Make x = (1,2,...,20).
<- 1:20
x
# Create A ‘weight’ vector of standard deviations.
<- 1 + sqrt(x)/2
w
# Create a data frame of two columns, x and y.
<- data.frame(x=x, y= x + rnorm(x)*w)
dummy
# Fit a simple linear regression
# With y to the left of the tilde then x, meaning y being dependent on x.
# Unlike other statistical packages, R does not display all output. It is recommended
# to create an object to store the estimates.
<- lm(y ~ x, data=dummy)
fm
# Display the summary of the output of model fm.
summary(fm)
```

```
##
## Call:
## lm(formula = y ~ x, data = dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0756 -1.3643 -0.1197 0.6690 5.6433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.50105 0.93931 -1.598 0.127
## x 1.09498 0.07841 13.964 4.24e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.022 on 18 degrees of freedom
## Multiple R-squared: 0.9155, Adjusted R-squared: 0.9108
## F-statistic: 195 on 1 and 18 DF, p-value: 4.24e-11
```

```
# Use w for a weighted regression.
<- lm(y ~ x, data=dummy, weight=1/w^2)
fm1
# Display the summary of the output of model fm1.
summary(fm1)
```

```
##
## Call:
## lm(formula = y ~ x, data = dummy, weights = 1/w^2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.15459 -0.45691 0.00188 0.27934 1.88730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.12842 0.67397 -1.674 0.111
## x 1.06052 0.06872 15.432 8.01e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7448 on 18 degrees of freedom
## Multiple R-squared: 0.9297, Adjusted R-squared: 0.9258
## F-statistic: 238.1 on 1 and 18 DF, p-value: 8.007e-12
```

```
# Make the columns in the data frame visible as variables.
attach(dummy)
# Make a nonparametric local regression function.
<- lowess(x, y)
lrf
# Standard point plot, with plotting character (pch) as bullet.
plot(x, y,pch=20)
# Add in the local regression.
lines(x, lrf$y)
# The true regression line: (intercept 0, slope 1, with dotted line type )
abline(0, 1, lty=3)
# Unweighted regression line.
abline(coef(fm))
# Weighted regression line.
abline(coef(fm1), col = "red")
```

```
# A standard regression diagnostic plot to check for heteroscedasticity. Can you see it?
plot(fitted(fm), pch=20, resid(fm), xlab="Fitted values", ylab="Residuals", main="Residuals vs Fitted")
# How about now?
abline(0,0, col="red")
```

```
# A normal scores plot to check for skewness, kurtosis and outliers.
qqnorm(resid(fm), main="Residuals Rankit Plot", pch=17)
```

```
# Cleaning up
rm(list = ls())
```

## 3.5 Illustration

In this section, we demonstrate exploring data about Taiwan elections in 2016. The Taiwan Election and Democratization Study (TEDS) is one of the longest and most comprehensive elections studies starting in 2001. TEDS collects data through different modes of surveys including face-to-face interviews, telephone interviews and internet surveys. More detail of TEDS can be found at the National Chengchi University Election Study Center website at http://teds.nccu.edu.tw/main.php.

*Taiwan Election and Democratization Study 2016 data*

```
# Import the TEDS 2016 data in Stata format using the haven package
## install.packages("haven")
library(haven)
<- read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")
TEDS_2016
# Prepare the analyze the Party ID variable
# Assign label to the values (1=KMT, 2=DPP, 3=NP, 4=PFP, 5=TSU, 6=NPP, 7="NA")
$PartyID <- factor(TEDS_2016$PartyID, labels=c("KMT","DPP","NP","PFP", "TSU", "NPP","NA")) TEDS_2016
```

Take a look at the variable:

```
# Check the variable
attach(TEDS_2016)
head(PartyID)
```

```
## [1] NA NA KMT NA NA DPP
## Levels: KMT DPP NP PFP TSU NPP NA
```

`tail(PartyID)`

```
## [1] NA NA DPP NA NA NA
## Levels: KMT DPP NP PFP TSU NPP NA
```

Frequency table:

```
# Run a frequency table of the Party ID variable using the descr package
## install.packages("descr")
library(descr)
freq(TEDS_2016$PartyID)
```

```
## TEDS_2016$PartyID
## Frequency Percent
## KMT 388 22.9586
## DPP 591 34.9704
## NP 3 0.1775
## PFP 32 1.8935
## TSU 5 0.2959
## NPP 43 2.5444
## NA 628 37.1598
## Total 1690 100.0000
```

Get a better chart of the Party ID variable:

```
# Plot the Party ID variable
library(ggplot2)
ggplot(TEDS_2016, aes(PartyID)) +
geom_bar()
```

We can attend to more detail of the chart, such as adding labels to x and y axes, and calculating the percentage instead of counts.

```
ggplot(TEDS_2016, aes(PartyID)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties")
```

Adding colors, with another theme:

```
ggplot(TEDS_2016, aes(PartyID)) +
geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) +
scale_y_continuous(labels=scales::percent) +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties") +
theme_bw()
```

Hold on, colors are not right!

```
ggplot(TEDS_2016, aes(PartyID)) +
geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) +
scale_y_continuous(labels=scales::percent) +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties") +
theme_bw() +
scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))
```

To make the chart more meaningful, we can use a package called tidyverse to manage the data.

```
##install.packages("tidyverse")
library(tidyverse)
%>%
TEDS_2016 count(PartyID) %>%
mutate(perc = n / nrow(TEDS_2016)) -> T2
ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) +
geom_bar(stat = "identity") +
ylab("Party Support (%)") +
xlab("Taiwan Political Parties") +
theme_bw() +
scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))
```

## 3.6 Exercise

Analyze the Tondu (統獨）variable using the following procedures:

- Assign label to each category
- Run a frequency table using descr
- Plot the variable using ggplot2

Hint:

- Prepare the analyze the Tondu variable using these labesl: (“Unification now,”“Status quo, unif. in future,”“Status quo, decide later,”“Status quo forever,” “Status quo, indep. in future,” “Independence now,”“No response”)
- Sample codes:

`TEDS_2016$Tondu<-factor(TEDS_2016$Tondu,labels=c("Unification now","Status quo, unif. in >future","Status quo, decide later","Status quo forever", >"Status quo, indep. in future", "Independence now","No >response"))`