Chapter 3 Data methods

3.1 Data Production and Collection

Data can be classified by generation methods into two types: made data and found data. These methods include:

  1. Survey
  2. Experiments
  3. Qualitative Data
  4. Text Data
  5. Web Data
  6. Machine Data
  7. Complex Data
  • Network Data
  • Multiple-source linked Data

The first three methods are considered made or produced data, which go through a well-thought out design process, and data are generated by an instrument or human interviews. These methods usually go through a sampling or pre-selection mechanism, in order that the data will represent the population to a certain extent. The remaining methods are called found or collected data that are primarily extracted from other sources without an intrument or a design process. The sampling mechanism is not pre-designed and usually subject to the property of the extraction algorithm.

Some researchers called the former type or made data as small data and the latter the big data. Statistian Leo Breiman (2001) describe these two types of data as two cultures of statistical modeling.

3.2 Data collection showcase: web data

  1. APIs (Application program interface)
  • API provides channels to allow an interaction with, and retrieval of, structured data.
  • Many data companies such as Facebook, Twitter and public and private organizations provide APIs for developers or users to access data directly.
  • API methods however limit users to get the data according to the company’s restrictions.
  1. Webscraping
  • this method generally refers to using algorithm or simple program to obtain information directly from web pages.
  • data collected using this methods is generally raw and unstructured, meaning more data curation and clean-up are needed before data can be used.

The following chart from Munzert et al. (2015) illustrates the technologies of dealing with web data:

Source: Munzert et al. 2015.Automated Data Collection with R: a Practical Guide to Web Scraping and Text Mining

Source: Munzert et al. 2015.Automated Data Collection with R: a Practical Guide to Web Scraping and Text Mining