January 26, 2018

For Weighting Online Opt-In Samples, What Matters Most?

Appendix C: Adjustment procedures

Raking

Raked weights were created using the marginal distributions of the adjustment variables as derived from the synthetic population dataset, along with all two-way interactions of collapsed versions of the demographic variables. For the interactions, the 18-24 and 25-34 age categories were combined, the less than high school and high school graduate categories were combined, race and Hispanic ethnicity was collapsed into white vs nonwhite, and census region was used instead of census division. This was done to avoid low adjustment cell counts and the chance that a subsample would yield a cell with no observations.

The calibrate function in the survey package in R29 was used for raking. When raking was used as the final step in a combination procedure, such as matching followed by propensity weighting followed by raking, the interim weights were trimmed at the 5th and 95th percentiles. No trimming was applied to the final weights.

Random forest

Both the matching and propensity adjustments used in this study were carried out using a statistical approach called random forest. Random forest models belong to a more general set of machine-learning models called classification and regression trees. These models work by partitioning the data into smaller and smaller subsets, called “nodes.” The partitions resemble a tree-like structure, hence the name. The further down the tree, the more all observations within a node agree with each other over whatever the outcome measure is.

The covariates fed into the model become the basis for which the data is split into nodes. For instance, one such split early on may divide the data into a node for the male cases and another node for the female cases. Each of those nodes may then be split further on some other covariate. The tree is considered fully grown when either all observations within every single node agree on the outcome, or when any further splitting would bring the number of observations in a node below a user-defined minimum size. The nodes at the end of the tree are called terminal nodes.

In random forest models, numerous trees are grown, with each tree being fit on a bootstrapped sample of the data, and with each tree’s partitions being determined using only a subset of the covariates in the full model. Predicted probabilities and proximity measures are then calculated by averaging across all the trees.

For this study, random forest models were fit in R using the ranger package.30 All models used 1,000 trees and had a minimum node size of 100.

Propensity weighting

The online opt-in sample and the full synthetic population dataset were combined and a new binary variable was created with a value of 1 if the case came from the synthetic dataset, and zero otherwise. A random forest model was then fit with the binary variable as the outcome and the adjustment variables as the covariates. The model then returned a predicted probability  that each case in the combined dataset came from the synthetic dataset. The quantity 1 – p is then the predicted probability that each case in the combined dataset came from the online opt-in sample. Subsequently, for each case in the online opt-in sample, the propensity weight was p / (1 – p) .

The resulting weights were rescaled to sum to the size of the online opt-in sample.

Matching

For matching, the online opt-in sample was combined with a target sample of 1,500 cases that were randomly selected from the synthetic population. A random forest model was then used to predict whether or not each case belonged to the target sample based on the adjustment variables. The models used 1,000 trees and had a minimum node size of 100.

Once the model was fit, the “distance” between each case in the target sample and all of the cases in the survey sample was calculated. For a given tree, cases that are similar to one another end up in the same terminal node. The random forest proximity between any two cases is simply the number of trees in which they were placed in the same node divided by the total number of trees used in the model. For example, if a particular pair of cases ended up in the same terminal node in 300 trees and in different terminal nodes in the 700 other trees, then the random forest proximity for that pair would be 0.3. A proximity close to 1 means the cases are very similar to one another, while a proximity close to zero means they are very different. The RcppEigen package31 was used to speed up calculation.

After the random forest proximity for each pair was calculated, both the synthetic dataset and the online opt-in sample were sorted in random order. The final matched sample was selected by sequentially matching each of the 1,500 cases in the synthetic frame sample to the case from the online opt-in sample with which it has the largest random forest proximity, with ties being broken randomly. Matched cases were given weights of 1, while unmatched cases were given weights of zero.

  1. Thomas Lumley. 2017. “survey: Analysis of Complex Survey Samples.” R package version 3.32.
  2. Marvin N. Wright and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77(1), 1-17.
  3. Douglas Bates and Dirk Eddelbuettel. 2013. “Fast and Elegant Numerical Linear Algebra Using the RcppEigen Package.” Journal of Statistical Software 52(5), 1-24.