Skip to Main Content

Secondary Data Sets

Alignment with Your Research Question

This is the most important factor. The data set must be capable of answering your research question.

  • Variable Availability: Do the variables you need exist in the data set? Are they measured in a way that is consistent with your research needs? For example, if your research question involves a specific construct like "job satisfaction," you need to ensure the data set has a valid measure for it.
  • Population and Scope: Does the data set's population match the population you want to study? If you are interested in a specific age group, industry, or geographic location, the data set must contain a sufficient and representative sample of that population.
  • Timeframe: Is the data current enough for your research? If you're studying a recent phenomenon, an older data set might be irrelevant. Conversely, if you're looking at historical trends, you'll need data that spans a significant period. 

What is a Codebook? (Quantitative Data Sets)

A comprehensive codebook serves as a vital guide for a data set, ensuring that anyone using the data can understand its structure and content without ambiguity. It's especially crucial for reproducible research and data sharing. Here's a breakdown of its key components:

Variable Information: This is the core of the codebook. It provides a list of all variables in the data set, each with a unique name (e.g., age, gender, income). It also includes a variable label, which is a more descriptive name or a brief definition (e.g., "Age of respondent in years")

Data Types: The codebook specifies the type of data for each variable. This is a critical piece of information that determines how the data can be analyzed. This includes:

Categorical Data:

  • Nominal: Data with no inherent order (e.g., gender: male, female; eye color: blue, brown, green).

  • Ordinal: Data with a clear order or ranking (e.g., education level: high school, bachelor's, master's; satisfaction rating: very dissatisfied, dissatisfied, neutral, satisfied, very satisfied).

Quantitative Data:

  • Discrete: Data that can only take on specific, countable values (e.g., number of children, number of items purchased).

  • Continuous: Data that can be measured along a continuous scale (e.g., height, weight, temperature).

  • Other types like numeric (for calculations), character/string (for text), and date are also typically included.

Value Labels: For categorical variables, the codebook explains the codes used in the data set. For instance, if the variable "Gender" is coded as 1 and 2, the codebook would specify that 1 = Male and 2 = Female. This prevents misinterpretation.