Chapter 2 Introduction

Big Data in STEM is a survey course in Data Science. It features hands-on examples and exercises in Data Science concepts, theories and technical skills in mastering big data in STEM and related disciplines. This training focuses on data theory, data collection methods, data visualization and data modeling. Practical programming exercises will be provided to try on programming and processing with real data. Students are encouraged to apply the newly learnt theories and concepts to solve real world problems with creative solutions.

The first chapter introduces the general theory of data, followed by methods of data collections and management.

2.1 What is data?

  1. Kinds of Data
    1. Quantitative vs. Qualitative
    2. Structured vs. Semi-/unstructured
  2. Measurement
    1. Nominal
    2. Ordinal
    3. Interval
    4. Ratio

2.2 What is big data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

According Burt Monroe (2012), the new 5Vs of Big data include:

  • Volume
  • Variety
  • Velocity
  • Vinculation
  • Validity

Vinculation means “binding together”, indicating emphasizes interdependent nature or “networkedness” of social data and big data.

Validity or veracity refers to the quality of the data and how relevant big data is to the question in interest. In particular, how one can draw inference from big data.

Some researchers would also add “value” to emphasize whether the data provide insight instead of confusion to understanding the problem(s).

2.3 Illustration: The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the one of the top science journals Nature in 2008. However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.

Lazer, Kennedy, King and Vespignani (2014) took up the scrutiny and identified the problem. They suggested: " “Traditional ‘small data’ often offer information that is not contained (or containable) in big data”, and “by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT… one can substantially improve on the performance of GFT or the CDC alone.” (Lazer et al. 2014 Science)

Lesson learnt: Google should have highest power in data access but it still fails. Size still matters? Yes, but not first.