6  Data Management with R

This chapter focuses on the “Data” part in Data programming. In other words, we will cover methods of managing data not just locally but on cloud. It will cover the use of sparklyr to interact with Spark. Some basic concepts of relational database, database management systems and interfacing with database servers will be introduced.

Spark with R

Apache Spark is a distributed computing platform empowered to run analytics at a large scale and with sizable big data (Luraschi, Kuo, Ruiz 2020)

The concept of distributed computing is building on Google’s MapReduce: map and reduce. The map operation provides an arbitrary way to transform each file into a new file, whereas the reduce operation combines two files. Both operations require custom computer code, but the MapReduce framework takes care of automatically executing them across many computers at once. These two operations are sufficient to process all the data available on the web, while also providing enough flexibility to extract meaningful information from it. (LKR, Chapter 1)

Spark provided a richer set of verbs beyond MapReduce to facilitate optimizing code running in multiple machines. Spark also loaded data in-memory, making operations much faster than Hadoop’s on-disk storage.

This chapter introduces using Spark with R, which is particularly designed for high power data modeling with big data.

To get started, it is recommended to install Spark on your local machine (RStudio Cloud is not supporting Spark yet).

The following demonstrates the installation of a local Spark and running simple modeling procedures in modeling Taiwan election data.


# install Spark on your local computer, treating it like a cluster
# internet connection required


# Connect to cluster
sc <- spark_connect(master = "local")

# Check Spark connections and environment

# Copy data to Spark session's memory
tbl_teds16 <- copy_to(sc, TEDS2016, "spark_teds2016")

# Alternative method to load local csv to Spark
spark_read_csv(sc, name = "teds16",  path = "/path/TEDS2016.csv")

# Disconnect

sdf_describe(votetsai, cols = colnames(votetsai))

partitions <- tbl_teds16 %>%
  select(votetsai, dpp, kmt, unify, statusquo, female) %>% 
  sdf_random_split(training = 0.5, test = 0.5, seed = 1099)

fit <- partitions$training %>%
  ml_logistic_regression(votetsai ~ .)

pred <- ml_predict(fit, partitions$test)

## Use Spark and H2O
### https://docs.h2o.ai/sparkling-water/3.3/latest-stable/doc/rsparkling.html

if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Install packages H2O depends on
pkgs <- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
install.packages("h2o", "")  
# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.3/")


sc <- spark_connect(master = "local", version = "3.3.0")

h2oConf <- H2OConf()
hc <- H2OContext.getOrCreate(h2oConf)
vote_h2o <- hc$asH2OFrame(tbl_teds16)

vote_glm <- h2o.glm(x = c("dpp", "female"),
                      y = "votetsai",
                      training_frame = vote_h2o,
                      lambda_search = TRUE)