SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.9 Duplicate observations

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

    psid_path <- file.path("..", "datasets", "PSID.csv")
    psid_in <- read_csv(psid_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    psid_in <-
        obs_num = X1,
        intvw_num = intnum,
        person_id = persnum,
        marital_status = married
    psid <-
      psid_in %>%
    Observations: 4,856
    Variables: 8
    $ intvw_num      <dbl> 4, 4, 4, 4, 5, 6, 6, 7, 7, 7, 10, 10, 10, 11, 1...
    $ person_id      <dbl> 4, 6, 7, 173, 2, 4, 172, 4, 170, 171, 3, 171, 1...
    $ age            <dbl> 39, 35, 33, 39, 47, 44, 38, 38, 39, 37, 48, 47,...
    $ educatn        <dbl> 12, 12, 12, 10, 9, 12, 16, 9, 12, 11, 13, 12, 1...
    $ earnings       <dbl> 77250, 12000, 8000, 15000, 6500, 6500, 7000, 50...
    $ hours          <dbl> 2940, 2040, 693, 1904, 1683, 2024, 1144, 2080, ...
    $ kids           <dbl> 2, 2, 1, 2, 5, 2, 3, 4, 3, 5, 98, 3, 0, 0, 2, 0...
    $ marital_status <chr> "married", "divorced", "married", "married", "m...
  2. What variables define an observation in this data set?

    The variable that contains the interviewer number, the variable that contains the number identifying a person, and the variable that contains the age of that person.

  3. Are there any duplicate observations?

    dups <- 
      duplicated(select(psid, intvw_num, person_id, age)) |
      duplicated(select(psid, intvw_num, person_id, age), fromLast=TRUE)
    dup_person_age <-
      psid %>%
      filter(dups) %>%
      select(person_id, age, educatn, earnings) %>%
      arrange(person_id, age)
    # A tibble: 0 x 4
    # ... with 4 variables: person_id <dbl>, age <dbl>, educatn <dbl>,
    #   earnings <dbl>

    There are no duplicates occurances.