2  Introduction

This chapter introduces the general principles for data programming or coding involving data. Data programming is a practice that works and evolves with data. Unlike the point-and-click approach, programming allows the user to manage most closely the data and process data in more effective manner. Programs are designed to be replicable, by user and collaborators. A data program can be developed and updated iteratively and incrementally. In other words, it is building on the culminated works without repeating the steps. It takes debugging, which is the process of identifying problems (bugs) but, in fact, updating the program in different situations or with different inputs when used in different contexts, including the programmer himself or herself working in future times.

2.1 Principles of Data Programming

Social scientists Gentzkow and Shapiro (2014) list out some principles for data programming.

  1. Automation
    • For replicability (future-proof, for the future you)
  1. Version Control
  1. Directories/Modularity
    • Organize by functions and data chunks
  1. Keys
    • Index variable (relational)
  1. Abstraction
    • KISS (Keep in short and simple)
  1. Documentation
    • Comments for communicating to later users
  1. Management
    • Collaboration ready

2.2 Functionalities of Data Programs

A data program can provide or perform :

  1. Data source
  2. Documentation of data
  3. Importing and exporting data
  4. Management of data
  5. Visualization of data
  6. Data models

Sample R Programs:

R basics

# Create variables composed of random numbers
x <-rnorm(50) 
y = rnorm(x)

# Plot the points in the plane 
plot(x, y)

Using R packages

# Plot better, using the ggplot2 package 
## Prerequisite: install and load the ggplot2 package
## install.packages("ggplot2")
library(ggplot2)
qplot(x,y)

More R Data Visualization

# Plot better better with ggplot2
library(ggplot2)
ggplot(,aes(x,y)) + theme_bw() + geom_point(col="blue")