Validating 2016 voters in Pew Research Center’s survey data

One of the most common challenges facing election surveys is the tendency for some respondents to say they voted when they did not. This so-called overreporting can cause surveys to overstate voter turnout and create biases in the apparent composition of the electorate (this happens because some kinds of people are more likely than others to overreport voting).

Today, Pew Research Center is releasing an updated dataset that helps address this issue by matching the people who took our 2016 post-election survey with the turnout records contained in five commercial voter files. This allows researchers to verify which respondents actually voted.

This dataset is the basis for a report we issued on Aug. 9 about the trend over time in opinions about President Donald Trump among 2016 voters and the characteristics of the 2016 electorate. The dataset is available as an SPSS statistics file (.sav) and is accompanied by a ReadMe.txt file with information about the computation of the turnout variable. In this post, we’ll discuss the turnout measure in more detail.

As a reminder, Pew Research Center releases nearly all of its raw survey datasetsto the public. The release is typically delayed for a period that ranges from a few months to more than a year after collection in order to allow the Center’s staff to fully analyze and report on the data, as well as to clean and anonymize the files in order to protect respondents from the risk of being personally identified. All data for release can be found on our website, and a recent improvement in our process allows users to register for an account, after which they can download and manage datasets as often as desired.

How we created the turnout variable

To validate turnout among members of the American Trends Panel (ATP) — our nationally representative survey panel of U.S. adults — we attempted to link members to five commercial voter files. Two of the files are from nonpartisan vendors; two are from vendors that work primarily with Democratic and politically progressive clients; and one is from a vendor that works primarily with Republican and politically conservative clients.

Overall, 91% of the 3,985 active members of the ATP who took part in the post-election survey (conducted Nov. 29 to Dec. 12, 2016) and who provided a name yielded a match by at least one of the five vendors. We’ll call these individuals “matched respondents.” To estimate turnout, we used a composite estimate based on records in all five commercial voter files. Voters were defined as matched respondents who were recorded as having voted in at least one of the five commercial voter files. Nonvoters were defined as matched respondents who were listed in at least one file but had no record of voting in any files they matched, or respondents who were not matched in any of the five files. We assumed this last group were not registered voters and therefore had not voted.

Using this approach, the voter file-verified turnout rate among the panelists was 65%, or about 5 percentage points higher than the best estimate of national turnout among eligible adults. This difference is likely the result of the fact that surveys like this one tend to overrepresent politically engaged individuals.

For additional details about the voter file matching and voter verification process, see Pew Research Center’s March 2018 report on commercial voter files.

The variables of interest

The new dataset includes a new variable, VALIDATED_VOTER_2016_W23. This variable is coded 0 for nonvoters and 1 for validated voters. In computing the variable, noncitizens were excluded (sysmis in SPSS) since they are not eligible to vote in federal elections. The dataset also includes a variable named COMPORT_W23. This variable has four categories, each corresponding to a combination of VALIDATED_VOTER_2016_W23 and the self-reported voter turnout question, VOTED_W23. The four categories of COMPORT_W23 are:

1=Validated voters who said they voted
2= Nonvoters who said they did not vote or were not sure
3=Overreporters (nonvoters who said they voted)
4=Underreporters (Validated voters who said they did not vote or were not sure)

In order to replicate the analyses in the report, it is necessary to code the candidate preference variable for voters exactly as we did. Syntax for doing so is provided in the ReadMe.txt file. Candidate preferences for voters are based on respondents who said they voted for Donald Trump, Hillary Clinton, Gary Johnson or Jill Stein. Those who said they voted for another candidate, who could not recall who they voted for or refused to say who they voted for are excluded from the tabulation. The SPSS syntax is as follows:

*FOR VALIDATED VOTERS.
do if COMPORT_W23=1.
compute candprefvoter= votegenpost_w23.
missing values candprefvoter (5,99).
end if.
value labels candprefvoter 1 ‘Trump’ 2 ‘Clinton’ 3 ‘Johnson’ 4 ‘Stein’ 5 ‘Other’ 99 ‘DK, Refused’.
var labels candprefvoter ‘2016 vote among validated voters’.

In order to replicate the profile of nonvoters, simply restrict the analysis to respondents who are listed as COMPORT_W23 = 2,3,4.

What you can do with the data

The availability of the validated turnout variable opens the door to many further analyses. One of the most obvious is the ability to compare overreporters with people who accurately reported they did not vote. It also makes comparisons of voters and nonvoters much more accurate, since overreporters change the profile of nonvoters by their absence.

While our published report this month includes a large number of tabulations among validated voters, many more demographic, political and lifestyle variables are available in this panel wave and in other waves. Among many other topics, the waves conducted near the election included questions about social media, guns, the police, online harassment and feelings about religious groups.

One request: If you happen to use this data please consider sharing your findings with us. We are eager to see what further knowledge arises from this effort, and we may add updates to this piece to share what others have done.

Validating 2016 voters in Pew Research Center’s survey data

How we created the turnout variable

The variables of interest

What you can do with the data

More from Decoded

Only 1 in 7 House districts were competitive in 2012

Inside the 2012 Latino Electorate

Six Take-Aways from the Census Bureau’s Voting Report

Six take-aways from the Census Bureau’s voting report

The State of the News Media 2013: Annual Report on American Journalism