Pew Research Center releases its survey data publicly as IBM SPSS files with the .sav extension. But if you don’t have access to SPSS, there are free, open-source tools available to analyze and make use of the data.
Even with basic SPSS access, working with survey data requires additional tools or techniques to correctly handle survey weights or other complex survey design features. Analyses that fail to take these design features into account can produce biased results and overstate the precision of estimates or statistical tests. Fortunately, the tools to perform these kinds of analyses correctly are freely available with the R statistical software platform.
This post provides a quick tutorial on how to correctly analyze the Center’s survey data using R. This is the first in an occasional series of posts aimed at helping you analyze survey datasets using R.
What is R?
R is a language and environment for statistical computing and graphics. R is available as free software in source code form under the terms of the Free Software Foundation’s GNU General Public License. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. To read more about R and how to download it, visit r-project.org.
The analysis in this post will rely on:
— R
— R Studio (an open-source code editor and interface working in the R language)
— The following freely available R packages:
· foreign
· survey
· knitr
To install these packages, use the following code:
install.packages(c("foreign", "survey", "knitr"))
Accessing Pew Research Center data
Many Pew Research Center survey datasets are available for download by accessing the “Datasets” tab on the Center’s website. For more information about the kind of data the Center releases and how to access it, read this blog post.
Almost all of the data that’s available to download from the Center is stored as SPSS .sav files. SPSS files often contain both values and value labels — for example, 1 for Republican, 2 for Democrat.
This tutorial will use data from the Center’s April 2017 political survey, which focused on topics including Americans’ views of national institutions and their trust in government.
Loading the survey data into R
The first step to analyzing survey data in R is to read the data file into your R environment. Since the data is stored as a .sav file, you’ll want to use the read.spss() function from R’s “foreign” package. Below, we first load the package libraries and then read the data into a data.frame which we’ll call “Apr17”. By default, read.spss() retains all of the variable and value labels for the survey data, but it doesn’t automatically create a data.frame, so we have to set a parameter explicitly. Here we use to.data.frame = TRUE to load the file into our R environment as a data.frame.
library(foreign)
library(survey)
library(knitr)
Apr17 <- read.spss("Apr17 public.sav", #file path to dataset
to.data.frame = TRUE) #sets object to data frame## re-encoding from CP1252
If you run this code, you will get a warning for variables that do not have labels for every category — such as age. In these instances read.spss()will add these labels by default. If you are looking for a different behavior, check out theadd.undeclared.levels() option.
Most of the variables in the Center’s datasets — such as sex, race and so on — are categorical. In R, these kinds of variables are called factors. You can use the table() function to see how a factor variable is distributed as follows:
table(Apr17$party) ##
## Republican Democrat
## 375 466
## Independent No preference (VOL.)
## 616 28
## Other party (VOL.) Don't know/Refused (VOL.)
## 9 7
Setting up a survey design
The next step in analyzing the survey data is to use the svydesign function from R’s “survey” package to create a survey design object. This step is important in that it explicitly states the survey design to properly use survey weights and other design components for estimation. The svydesign function accepts many different forms of complex survey designs. To read more detail about the function, click here.
For the majority of Pew Research Center surveys, including the April 2017 dataset used in this tutorial, users need to specify three items when declaring the survey design:
1. The cluster identifiers with ids = . Almost all of the U.S.-based surveys from the Center do not have cluster identifiers. Use the ~0 formula to indicate this survey doesn’t have any clusters.
2. The survey dataset with data =
3. The survey weights with weights =
Apr17_design = svydesign(
ids = ~0, #formula indicating there are no clusters
data = Apr17, #this is the dataset
weights = ~weight) #this is the 'weight' variable
#from the Apr17 dataset
Estimating frequencies with survey weights
After the survey design is declared, you can obtain weighted estimates by using the svymean() function. The core arguments of svymean() are the formula identifying the variable you are interested in and the survey design object.
The svymean() function can be used to compute weighted means, variances, ratios, totals and more. The returned statistic is dependent on the class of the variable it is called upon. For example, to estimate President Donald Trump’s job approval (q1- a factor variable), use the following code:
svymean(~q1, #variable to estimate
design = Apr17_design #survey design object
#created with svydesign()
) ## mean SE
## q1Approve 0.394008 0.0144
## q1Disapprove 0.542368 0.0147
## q1Don't know/Refused (VOL.) 0.063624 0.0078
To look at Trump’s job approval among different subgroups, you can use the svyby() function, which computes statistics for subgroups of the dataset. The svymean() function can be used in conjunction with the svyby() function to compute weighted estimates on subsets of the data determined by other factor variables. The kable() function from the knitr package displays the statistics in tabular form.
To estimate presidential approval among men and women, for instance, you can use this code:
q1_by_sex = svyby(~q1, #variable to estimate
~sex, #subgroup variable
design = Apr17_design,
FUN = svymean, #function to use on each subgroup
keep.names = FALSE #does not include row.names
#for subgroup variable
)
knitr::kable(q1_by_sex, digits = 2)
This post just scratches the surface of the kinds of analyses you can do in R with the survey package, but I hope it’s enough to get you started. In the future, we plan to write additional posts on survey data analysis and visualization with R. If you have questions about this post, or if there are other things with survey data and R you’d like to know how to do, let us know at info@pewresearch.org.