Weighting survey data with the pewmethods R package

This post is a companion piece to our basic introduction for researchers interested in using the new pewmethods R package, which includes tips on how to recode and collapse data and display weighted estimates of categorical variables. Here, I’ll go through the process of weighting and analyzing a survey dataset. Along the way, I’ll show you how to use pewmethods to clean and recode the variables we’ll use for weighting, create weighting parameters from external data sources, and rake and trim survey weights.

These examples make extensive use of the tidyverse set of R packages. You can learn more about using the tidyverse with this post.

The example dataset

The package includes a survey dataset called dec13_excerpt, which contains selected variables from a survey conducted by Pew Research Center in December 2013. The data contains demographics and some outcome variables, as well as survey weights. You can learn more about the details by calling ?dec13_excerpt.

> dec13_excerpt
# A tibble: 2,001 x 14
   psraid cregion q1    q2    q45   sex   recage receduc
    <dbl> <fct>   <fct> <fct> <fct> <fct> <fct>  <fct>  
 1 100005 Northe… Disa… Very… Disa… Male  55-64  HS gra…
 2 100007 South   Appr… Very… Appr… Fema… 55-64  Coll+  
 3 100019 South   Appr… Very… Appr… Fema… 55-64  Some c…
 4 100020 Midwest Appr… Not … Appr… Fema… 65+    Some c…
 5 100021 Northe… Appr… Very… Appr… Fema… 65+    HS gra…
 6 100023 Midwest Don'… NA    Disa… Male  45-54  HS gra…
 7 100027 Northe… Appr… Very… Appr… Fema… 65+    Some c…
 8 100031 South   Disa… Very… Disa… Fema… 55-64  Coll+  
 9 100034 Midwest Disa… Very… Disa… Fema… 55-64  Some c…
10 100037 South   Disa… Very… Disa… Male  35-44  Coll+  
# … with 1,991 more rows, and 6 more variables:
#   racethn2 <fct>, party <fct>, partyln <fct>,
#   weight <dbl>, llweight <dbl>, cellweight <dbl>

For simplicity, let’s assume we want to weight our survey by the marginal distribution of age and the cross-classification of sex and education. (In practice, we use a number of additional variables and cross-classifications beyond these.) Let’s run some basic tables on these variables in the dec13_excerpt dataset:

> table(dec13_excerpt$sex)

  Male Female 
   968   1033
> table(dec13_excerpt$recage)

 18-24  25-34  35-44  45-54  55-64    65+ DK/Ref 
   172    216    255    362    406    550     40
> table(dec13_excerpt$receduc)

       HS grad or less Some coll/Assoc degree 
                   646                    571 
                 Coll+                 DK/Ref 
                   779                      5

Creating weighting targets from population data

Before doing anything with the survey data itself, we need to determine our weighting target parameters — that is, we need to know what the marginal distributions of age and the cross-classification of sex and education look like in the population of interest that we’re trying to represent with a survey. We use external benchmark data to create weighting targets that reflect the population distribution for our chosen weighting variables. These targets are typically derived from population data published by the U.S. Census Bureau or other government agencies. For example, we can download public use microdata from the American Community Survey and use that data to obtain target distributions.

For this demonstration, we’ll use a condensed American Community Survey dataset called acs_2017_excerpt. This is not the original ACS dataset (which can be found here), but a summary table created using the 2017 one-year PUMS. It has columns for sex, age and education variables that have been recoded into the categories that Pew Research Center typically uses in its survey weighting. It has a total of 36 rows, one for every combination of sex (two categories), age (six categories) and education (three categories). Each row is associated with a weight proportional to that row’s share of the non-institutionalized U.S. adult population:

> acs_2017_excerpt
# A tibble: 36 x 4
   sex   recage receduc                weight
   <fct> <fct>  <fct>                   <dbl>
 1 Male  18-24  HS grad or less         1.09 
 2 Male  18-24  Some coll/Assoc degree  0.955
 3 Male  18-24  Coll+                   0.208
 4 Male  25-34  HS grad or less         1.17 
 5 Male  25-34  Some coll/Assoc degree  0.986
 6 Male  25-34  Coll+                   1.04 
 7 Male  35-44  HS grad or less         1.12 
 8 Male  35-44  Some coll/Assoc degree  0.815
 9 Male  35-44  Coll+                   0.964
10 Male  45-54  HS grad or less         1.24 
# … with 26 more rows

When you begin this process from scratch, you’ll need to acquire the benchmark data, recode variables into your desired categories, and use the appropriate weight that should be attached to the benchmark dataset. All of that work has already been done in this post..

We can use the function create_raking_targets() to create summaries of these demographic distributions from the benchmark dataset using the code below. “Raking” refers to a procedure in which the marginal distributions of a selected set of variables in the sample are iteratively adjusted to match target distributions.

targets <- create_raking_targets(
  acs_2017_excerpt,
  vars = c("recage", "sex:receduc"),
  wt = “weight”
)

The first argument identifies the dataset to use in creating the targets. The vars argument lists the names of the variables to be summarized. When variable names are joined by a colon, it means we want the cross-classification or interaction between those variables. Finally, the wt argument takes the name of the weight variable that should be used for the calculation. If you do not specify a weight variable, create_raking_targets() will produce an error. This is to prevent you from accidentally creating incorrect weighting targets based on unweighted benchmark data.

The code above produces the following list containing the target distributions for age and sex by education. In each table, the Freq column contains the percentage of the total population belonging to each category, with each table summing to 100. These are the percentages we should expect to see in our own survey data after we’ve finished weighting:

> targets
[[1]]
# A tibble: 6 x 2
  rk_recage  Freq
  <fct>     <dbl>
1 18-24      12.3
2 25-34      17.8
3 35-44      16.4
4 45-54      16.9
5 55-64      16.8
6 65+        19.9

[[2]]
# A tibble: 6 x 2
  rk_sex_receduc                 Freq
  <fct>                         <dbl>
1 Male:HS grad or less           19.9
2 Male:Some coll/Assoc degree    14.4
3 Male:Coll+                     14.0
4 Female:HS grad or less         19.3
5 Female:Some coll/Assoc degree  16.6
6 Female:Coll+                   15.8

While we could have come up with these percentages in a variety of ways, create_raking_targets() returns them in the format that is expected by the rake_survey() function that we’ll use to perform the actual weighting. This format is also compatible with the survey package since this is what rake_survey() itself uses to do the computation.

Cleaning and recoding survey data

To weight our survey, we need to have variables in our dataset that have exactly the same names, categories and labels as the variables in the list of weighting parameters. If we compare the categories in the variables in dec13_excerpt to our weighting targets above, we can see some important differences:

> levels(dec13_excerpt$sex)
[1] "Male"   "Female"
> levels(dec13_excerpt$recage)
[1] "18-24"  "25-34"  "35-44"  "45-54"  "55-64"  "65+"   
[7] "DK/Ref"
> levels(dec13_excerpt$receduc)
[1] "HS grad or less"        "Some coll/Assoc degree"
[3] "Coll+"                  "DK/Ref"

In the survey data, sex, recage and receduc have categories and labels that match those in the weighting targets, but recage and receduc also have DK/Ref (short for “Don’t Know/Refused”) categories for respondents who declined to give an answer. The survey data also lacks a variable containing the cross-classification of age and education. Finally, the variable names in the weighting targets begin with the prefix “rk_” while the survey variables do not. We’ll need to iron out all of these inconsistencies before we can move on to weighting.

Dealing with DKs

The DK/Ref responses to age and education pose a problem for weighting because every respondent in the sample needs to have complete data on all the parameters that will be used for weighting. We get around this problem using imputation, which means replacing the DKs with statistically plausible values from among the other categories. At the same time, we generally want to retain the DK/Ref responses in our dataset, because they can tell us important things about the respondents.

Instead of replacing the original DK/Ref values in our data, we’ll make copies of all the variables, convert any DK/Ref values to NA and impute the copies. By convention, we refer to these copies as the “raking” variables and include the prefix “rk_” in their variable names so they can be easily identified.

The dk_to_na() function makes it easy to create separate raking variables and get them ready for imputation by converting DK/Ref responses into NA. By default, the function will search for variations on the string “Don’t Know/Refused” that appear frequently in Pew Research Center survey data, but you can substitute any other regular expression if needed.

dec13_excerpt_raking <- dec13_excerpt %>%
  mutate(rk_recage = dk_to_na(recage),
         rk_receduc = dk_to_na(receduc),
         rk_sex = dk_to_na(sex))

The tablena() function is a quick way to display a table that will display NA values by default if they exist. It also shows the name and class of any variables in the table. Looking at the bottom left corner of the table below, we can see that in our newly created variable rk_recage, the DK/Ref responses have been converted to NA. The variable name and its factor levels match the corresponding weighting target. All that’s left is to impute the missing values.

> with(dec13_excerpt_raking, tablena(recage, rk_recage))
                rk_recage, a factor
recage, a factor 18-24 25-34 35-44 45-54 55-64 65+ <NA>
          18-24    172     0     0     0     0   0    0
          25-34      0   216     0     0     0   0    0
          35-44      0     0   255     0     0   0    0
          45-54      0     0     0   362     0   0    0
          55-64      0     0     0     0   406   0    0
          65+        0     0     0     0     0 550    0
          DK/Ref     0     0     0     0     0   0   40

The impute_vars() function is a wrapper around the mice() function from the package of the same name that will singly impute specified variables in a dataset. (We use single imputation because weighting variables generally contain very small amounts of missing data that result in very little variability in the resulting survey weights at the end of the process, but you should exercise caution if you want to weight on variables with a lot of missing data.) The function is designed around this workflow, so by default it only imputes missing values for variables with the “rk_” prefix in their name. Alternatively, you can specify the variables to impute by passing a vector of variable names.

Below, we’ll create a new dataset called dec13_excerpt_imputed, where the missing values in the raking variables have been filled in. By default, impute_vars() uses a form of hot-deck imputation based on random forests and the ranger package. This is a very fast and convenient way to impute small amounts of item-missing data to facilitate weighting. The seed argument can be any number you want. It ensures that the imputation, which has some randomness built in, can be reproduced exactly every time as long as the same seed is used.

> dec13_excerpt_imputed <- impute_vars(dec13_excerpt_raking, 
                                       seed = 739)
No input to to_impute argument found. Imputing variables with prefix rk_ by default.

 iter imp variable
  1   1  rk_recage  rk_receduc
  2   1  rk_recage  rk_receduc
  3   1  rk_recage  rk_receduc
  4   1  rk_recage  rk_receduc
  5   1  rk_recage  rk_receduc

We can run tablena() again to confirm that the raking variables were successfully imputed.

> tablena(dec13_excerpt_imputed$rk_recage)
dec13_excerpt_imputed$rk_recage, a factor
18-24 25-34 35-44 45-54 55-64   65+ 
  175   222   262   373   409   560 
> tablena(dec13_excerpt_imputed$rk_receduc)
dec13_excerpt_imputed$rk_receduc, a factor
       HS grad or less Some coll/Assoc degree 
                   648                    573 
                 Coll+ 
                   780

Finally, because we intend to weight on the cross-classification or interaction between sex and education, we create that variable using imputed sex and imputed education, making sure it has the same name and same factor levels as the target created by create_raking_targets():

dec13_excerpt_imputed <- dec13_excerpt_imputed %>%
 mutate(rk_sex_receduc = interaction(rk_sex, rk_receduc, sep = “:”))

Creating the weights

After creating raking targets from the population data and raking varaibles in the sample data, we can finally create the raking weight, which we’ll call weight2 (to differentiate from the dataset already having a “proper” weight variable).

> dec13_excerpt <- dec13_excerpt %>%
+     mutate(weight2 = rake_survey(dec13_excerpt_imputed, 
                                   pop_margins = targets))
> summary(dec13_excerpt$weight2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5073  0.7321  0.9226  1.0000  1.1861  2.0953

The weight is created by calling rake_survey() on the dec13_excerpt_imputed dataset we created, but we use mutate() to attach this weight to the original dec13_excerpt dataset, setting aside all the additional temporary variables we created along the way. rake_survey() also contains additional optional arguments for declaring a base weight, setting the amount of tolerance for the raking algorithm and other such tweaks that you can look up using ?rake_survey or ?survey::calibrate, around which rake_survey() is wrapped.

Diagnostic tools

The calculate_deff() function can be used to obtain the Kish approximation of the design effect, as well as the effective sample size and a conservative margin of error estimate.

> calculate_deff(dec13_excerpt$weight2)
# A tibble: 1 x 6
      n mean_wt sd_wt  deff   ess   moe
  <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
1  2001      1. 0.369  1.14 1761.  2.34

The design effect deff here can be thought of as a multiplier representing additional variance in your estimates due to weighting. If you take your sample size n and divide it by deff, you’ll get the effective sample size ess, which tells you the sample size of a true simple random sample with the same variance as your weighted survey. For example, even though dec13_excerpt has a raw sample size of 2001, the margin of error moe for weight2 is 2.34, which is also what you would get if you had a simple random sample with a sample size of 1761.

You should also confirm that the weighted estimates of weighting variables such as rk_recage match the weighting targets.

Trimming weights

In some cases, weighting survey data results in particularly large or small weights, which can reduce the effective sample size. Weight trimming is one way to reduce the design effect from weighting by setting bounds on the maximum and minimum values of the weights. The tradeoff is that the weighted distributions of the weighting variables will deviate somewhat from the weighting targets.

The trim_weights() function is a wrapper around trimWeights from the survey packages that allows you to trim survey weights by either defining lower and upper quantiles or minimum and maximum values to cut off. Survey researchers try to strike a balance between the design effect and the weighted sample composition, but there is no exact guideline for what this balance should look like. Here’s an example of what the impact of a “5/95” trim on our demonstration weight would be on the effective sample size and on the weighted sample composition:

dec13_excerpt <- dec13_excerpt %>%
 mutate(weight2_trimmed = trim_weights(weight2, 
                                       lower_quantile = 0.05, 
                                       upper_quantile = 0.95))
 
> calculate_deff(dec13_excerpt$weight2)
# A tibble: 1 x 6
      n mean_wt sd_wt  deff   ess   moe
  <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
1  2001      1. 0.369  1.14 1761.  2.34
> calculate_deff(dec13_excerpt$weight2_trimmed)
# A tibble: 1 x 6
      n mean_wt sd_wt  deff   ess   moe
  <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
1  2001      1. 0.344  1.12 1789.  2.32
 
> get_totals("recage", dec13_excerpt, 
             wt = c("weight2", "weight2_trimmed"), 
             digits = 1)
  recage weight2 weight2_trimmed
1  18-24    12.1            12.2
2  25-34    17.4            16.4
3  35-44    16.0            16.1
4  45-54    16.4            16.6
5  55-64    16.6            16.8
6    65+    19.6            19.9
7 DK/Ref     2.0             2.0

Trimming at “5/95” increased the effective sample size from 1761 to 1789, but also made the weighted distribution of age in the sample look slightly less like the population, particularly among people ages 25 to 34. For that last bit of code, we used the pewmethods get_totals() function, which allows for flexibly computing weighted crosstabs. I discuss more about the get_totals function in a separate post about using pewmethods to explore survey data and calculate weighted estimates.

Weighting survey data with the pewmethods R package

The example dataset

Creating weighting targets from population data

Cleaning and recoding survey data

Dealing with DKs

Creating the weights

Diagnostic tools

Trimming weights

More from Decoded

Charting Congress on Social Media in the 2016 and 2020 Elections

Confronting 2016 and 2020 Polling Limitations

Q&A: After misses in 2016 and 2020, does polling need to be fixed again? What our survey experts say

A voter data resource: Detailed demographic tables about verified voters in 2016, 2018

A majority of Americans continue to favor replacing Electoral College with a nationwide popular vote