Exploring survey data with the pewmethods R package

(Related posts: Introducing pewmethods: An R package for working with survey dataWeighting survey data with the pewmethods R package and Analyzing international survey data with the pewmethods R package)

The methods team at Pew Research Center regularly works with survey data in R, and we’ve written many functions to simplify daily tasks like cleaning, weighting and analyzing data. The pewmethods R package, available to the public on the Center’s GitHub page, evolved as a way to reuse and maintain this sort of code and share it with other researchers around the Center. Since many of the problems that these functions were designed to solve are not unique to our own projects, we’ve made the pewmethods package publicly available for other researchers who might find the functions useful, too.

This post goes through the process of exploring survey data in R using pewmethods, including recoding and collapsing data and displaying weighted estimates of categorical variables. Throughout these examples, I’ll make extensive use of the tidyverse set of R packages, which is a great tool for data manipulation that we highly recommend using along with pewmethods. You can learn more about using tidyverse in this blog post.

The example dataset

The package includes a survey dataset called dec13_excerpt, which contains selected variables from a survey conducted by Pew Research Center in December 2013. The data contains demographic and some outcome variables, as well as survey weights. You can learn more about the details by calling ?dec13_excerpt.

> dec13_excerpt
# A tibble: 2,001 x 14
   psraid cregion q1    q2    q45   sex   recage receduc racethn2
    <dbl> <fct>   <fct> <fct> <fct> <fct> <fct>  <fct>   <fct>   
 1 100005 Northe… Disa… Very… Disa… Male  55-64  HS gra… White~H…
 2 100007 South   Appr… Very… Appr… Fema… 55-64  Coll+   White~H…
 3 100019 South   Appr… Very… Appr… Fema… 55-64  Some c… Black~H…
 4 100020 Midwest Appr… Not … Appr… Fema… 65+    Some c… White~H…
 5 100021 Northe… Appr… Very… Appr… Fema… 65+    HS gra… Black~H…
 6 100023 Midwest Don'… NA    Disa… Male  45-54  HS gra… White~H…
 7 100027 Northe… Appr… Very… Appr… Fema… 65+    Some c… White~H…
 8 100031 South   Disa… Very… Disa… Fema… 55-64  Coll+   White~H…
 9 100034 Midwest Disa… Very… Disa… Fema… 55-64  Some c… White~H…
10 100037 South   Disa… Very… Disa… Male  35-44  Coll+   White~H…
# … with 1,991 more rows, and 5 more variables: party <fct>,
#   partyln <fct>, weight <dbl>, llweight <dbl>, cellweight <dbl>

Most Pew Research Center survey datasets, as well as those from other organizations, will have one or more variables for the survey weight. This weight is crucial for obtaining correct numbers from the survey data, since it allows the sample to resemble the overall U.S. adult population more closely. In dec13_excerpt, the weight variable is simply called weight, and we’ll be using it to look at weighted cross-tabulations of other variables in the dataset.

Cleaning and editing survey data

Let’s look at some outcome variables:

> names(dec13_excerpt)
 [1] "psraid"     "cregion"    "q1"         "q2"        
 [5] "q45"        "sex"        "recage"     "receduc"   
 [9] "racethn2"   "party"      "partyln"    "weight"    
[13] "llweight"   "cellweight"

We see three variables that look like survey outcomes: q1, q2 and q45. Let’s take a look at q1. As dec13_excerpt was originally stored as an IBM SPSS file, we can use the get_spss_label() function to view the label associated with q1. For Pew Research Center survey data, this will either be the question wording or a brief description:

> get_spss_label(dec13_excerpt, "q1")
[1] "Q.1 Do you approve or disapprove of the way Barack Obama is handling his job as President?"

q1 is an Obama approval question, so let’s run a quick table:

> tablena(dec13_excerpt$q1)
dec13_excerpt$q1, a factor
                  Approve                Disapprove 
                      839                      1042 
Don't know/Refused (VOL.) 
                      120

The tablena() function in pewmethods works the same way as base R’s table() function, except that it tells you the specific object you just ran a table on (along with its class), and will always display any NA values rather than hiding them by default.

Now let’s look at q2:

> get_spss_label(dec13_excerpt, "q2")
[1] "Q.2 Do you [approve/disapprove] very strongly, or not so strongly?"

q2 is a direct follow-up to q1. After asking respondents whether they approved or disapproved of Obama, we asked them whether they did so very strongly or not so strongly. So q1 and q2 are best analyzed together.

We can do this by creating a new variable, which we’ll call obama_approval_scale, that combines the two into a single variable with the categories Approve very strongly, Approve not so strongly, Disapprove not so strongly and Disapprove very strongly, as well as Don’t know/Refused (VOL.). The fct_case_when() function is a straightforward and readable way to create that combined variable. fct_case_when() is a wrapper around the case_when() function from dplyr that coerces its output into a factor whose levels are in the order that they were passed into the function.

We’ll create our new obama_approval_scale by calling the mutate function that can create new variables. (For more about how mutate() and case_when() work, read our earlier post about using tidyverse tools with Pew Research Center survey data in R.)

dec13_excerpt <- dec13_excerpt %>%
  mutate(
    obama_approval_scale = fct_case_when(
      q1 == “Approve” & q2 == “Very strongly” ~ “Approve very strongly”,
      q1 == “Approve” & q2 == “Not so strongly” ~ “Approve not so strongly”, 
      q1 == “Disapprove” & q2 == “Not so strongly” ~ “Disapprove not so strongly”,
      q1 == “Disapprove” & q2 == “Very strongly” ~ “Disapprove very strongly”,
      TRUE ~ “Don’t know/Refused (VOL.)”
    )
  )

The TRUE in the last line means that every case that doesn’t fall into one of the above conditions is to be recoded as Don’t know/Refused (VOL.).

Let’s view yet another quick table of our new obama_approval_scale variable to confirm that all responses were coded correctly:

> tablena(dec13_excerpt$obama_approval_scale)
dec13_excerpt$obama_approval_scale, a factor
     Approve very strongly    Approve not so strongly 
                       492                        293 
Disapprove not so strongly   Disapprove very strongly 
                       163                        863 
 Don't know/Refused (VOL.) 
                       190

Getting weighted estimates with get_totals

The all-purpose workhorse function for getting weighted estimates is get_totals(), which takes a categorical variable (either a character vector or a factor will do) and provides weighted or unweighted percentages or totals for each category. Rather than a table, get_totals() will instead return a data.frame, which allows for more flexible manipulation.

Let’s see what the percentages for each category in obama_approval_scale look like, first without any weights:

> get_totals("obama_approval_scale", dec13_excerpt, digits = 1)
        obama_approval_scale unweighted
1      Approve very strongly       24.6
2    Approve not so strongly       14.6
3 Disapprove not so strongly        8.1
4   Disapprove very strongly       43.1
5  Don't know/Refused (VOL.)        9.5

Now let’s look at Obama approval using the survey weight in the dataset:

> get_totals("obama_approval_scale", dec13_excerpt, wt = "weight", digits = 1)
        obama_approval_scale weight
1      Approve very strongly   25.7
2    Approve not so strongly   16.3
3 Disapprove not so strongly    9.2
4   Disapprove very strongly   38.9
5  Don't know/Refused (VOL.)    9.9

The weighted estimate of Obama approval looks quite different from the unweighted estimate. The survey weight in this case adjusts the sample to resemble the U.S. population more closely along a range of demographic variables including sex, education, race, Hispanic origin, census region, population density and whether respondents have only a landline phone, only a cell phone or both. We can put both the unweighted and weighted estimates side by side to make the comparison clearer via the include_unw argument:

> get_totals("obama_approval_scale", dec13_excerpt, wt = "weight", digits = 1, include_unw = TRUE)
        obama_approval_scale weight unweighted
1      Approve very strongly   25.7       24.6
2    Approve not so strongly   16.3       14.6
3 Disapprove not so strongly    9.2        8.1
4   Disapprove very strongly   38.9       43.1
5  Don't know/Refused (VOL.)    9.9        9.5

Some datasets may come with multiple weights. For example, there may be a weight that accounts only for the initial selection probabilities of each respondent and a weight that also accounts for nonresponse. We also occasionally construct different weights as part of the Center’s methodological research. get_totals can look at weighted estimates using multiple weights in unison: Just pass a character vector containing the names of all the weights you want to look at to the wt argument.

dec13_excerpt, for example, also comes with separate weights for landline and cellphone respondents. This is intended primarily for methodologists to assess the differences between the landline and cellphone samples. We can look at obama_approval_scale using each weight side by side:

> get_totals("obama_approval_scale", dec13_excerpt, wt = c("weight", "llweight", "cellweight"), digits = 1)
        obama_approval_scale weight llweight cellweight
1      Approve very strongly   25.7     24.4       26.0
2    Approve not so strongly   16.3     14.2       17.5
3 Disapprove not so strongly    9.2      9.2        9.5
4   Disapprove very strongly   38.9     41.5       38.2
5  Don't know/Refused (VOL.)    9.9     10.7        8.8

Weighted crosstabs

The by argument allows us to make cross-tabulations. For example, let’s look at Obama approval by education:

> get_totals("obama_approval_scale", dec13_excerpt, wt = "weight", 
             by = "receduc", digits = 1)
        obama_approval_scale weight_name HS grad or less
1      Approve very strongly      weight            28.1
2    Approve not so strongly      weight            16.3
3 Disapprove not so strongly      weight             8.8
4   Disapprove very strongly      weight            36.4
5  Don't know/Refused (VOL.)      weight            10.4
  Some coll/Assoc degree Coll+ DK/Ref
1                   21.5  26.8    0.0
2                   12.5  20.6   14.1
3                   11.1   7.7    0.0
4                   43.7  37.2   43.2
5                   11.3   7.7   42.7

In this table, the columns add to 100%, allowing you to make a statement like this one: “About a quarter (27%) of those with bachelor’s degrees or higher and 28% of those with a high school education or less approved of Obama very strongly, compared to 21% of those with some college or an associate’s degree.” But keep in mind that a simple crosstab like this can’t tell you whether these differences are statistically meaningful or not. You can do statistical testing of survey estimates — with uncertainty intervals — using the survey package.

We can clean the above table up a bit via including the breakdown of Obama approval among the full sample via the by_total argument, and by using the select function to remove the column that looks at Obama approval among people who didn’t answer the education question (which we don’t usually need to see). Perhaps we also want to round our numbers to integers:

> get_totals("obama_approval_scale", dec13_excerpt, wt = "weight", 
             by = "receduc", by_total = TRUE, digits = 0) %>%
    select(-`DK/Ref`)
        obama_approval_scale weight_name HS grad or less
1      Approve very strongly      weight              28
2    Approve not so strongly      weight              16
3 Disapprove not so strongly      weight               9
4   Disapprove very strongly      weight              36
5  Don't know/Refused (VOL.)      weight              10
  Some coll/Assoc degree Coll+ Total
1                     21    27    26
2                     12    21    16
3                     11     8     9
4                     44    37    39
5                     11     8    10

The above table looks at Obama approval among people with different levels of education. But what if we instead wanted to flip the premise and look at education among people with different levels of Obama approval? For output consistency reasons, get_totals doesn’t allow rows to sum to 100. Instead, you simply flip the var argument (the first one) and by.

> get_totals("receduc", dec13_excerpt, wt = "weight", 
             by = "obama_approval_scale",
             by_total = TRUE, digits = 0)
                 receduc weight_name Approve very strongly
1        HS grad or less      weight                    46
2 Some coll/Assoc degree      weight                    26
3                  Coll+      weight                    28
4                 DK/Ref      weight                     0
  Approve not so strongly Disapprove not so strongly
1                      42                         40
2                      24                         37
3                      34                         23
4                       0                          0
  Disapprove very strongly Don't know/Refused (VOL.) Total
1                       39                        44    42
2                       34                        35    31
3                       26                        21    27
4                        0                         0     0

If you pass an argument to by, the column names will be the names of the categories of the grouping variable, while the weight will receive its own column called weight_name, which is useful when you’re looking at multiple weights in unison. If no argument is passed to by, then the column names will be the names of the weights used.

Producing multiple crosstabs at once

Surveys often contain a lot of questions, and survey researchers often want to create a lot of crosstabs all at once. While each call to get_totals only allows one variable to be passed to var at a time, you can easily create multiple crosstabs in the same call via the map function from the purrr package, which is also part of the tidyverse. For example, if we want crosstabs for both obama_approval_scale and q45 (Obamacare approval), we can use map as follows:

> outcome_variables <- c("obama_approval_scale", "q45")
> my_xtabs <- map(outcome_variables, 
                ~get_totals(.x, dec13_excerpt, wt = "weight",
                            by = "receduc", by_total = TRUE, 
                            digits = 1) %>%
  select(-`DK/Ref`))
> my_xtabs
[[1]]
        obama_approval_scale weight_name HS grad or less
1      Approve very strongly      weight            28.1
2    Approve not so strongly      weight            16.3
3 Disapprove not so strongly      weight             8.8
4   Disapprove very strongly      weight            36.4
5  Don't know/Refused (VOL.)      weight            10.4
  Some coll/Assoc degree Coll+ Total
1                   21.5  26.8  25.7
2                   12.5  20.6  16.3
3                   11.1   7.7   9.2
4                   43.7  37.2  38.9
5                   11.3   7.7   9.9

[[2]]
                        q45 weight_name HS grad or less
1                   Approve      weight            41.5
2                Disapprove      weight            53.5
3 Don't know/Refused (VOL.)      weight             5.0
  Some coll/Assoc degree Coll+ Total
1                   34.2  48.2  41.1
2                   60.9  47.8  54.2
3                    4.9   4.0   4.7

We’ve just created a list of crosstabs called my_xtabs. To save these in a more permanent format, we can write this list of tables to a Microsoft Excel spreadsheet with the df_list_to_xlsx() function with the following line of code (commented out so that it won’t write any files to your system unless you remove the pound sign):

# df_list_to_xlsx(my_xtabs, sheet_name = "pew_xtabs", 
                  outfile = "pewmethods_demo_xtabs.xlsx")

df_list_to_xlsx() takes a list of data.frames like the crosstabs we just created, as well as arguments for what we want to name the Excel sheet and the file output path. There are optional arguments for adding labels to each crosstab and a title to the top row of the spreadsheet. df_list_to_xlsx() is also capable of writing an Excel file with multiple sheets. Details can be viewed by calling ?df_list_to_xlsx.

You can do a wide array of useful data munging tasks with pewmethods in combination with other functions from other packages as well as your own code. These are just a few examples. We hope you find the package useful!

More from Decoded

About Decoded