October 25, 2016

Oversampling is used to study small groups, not bias poll results

Every election year, questions arise about how polling techniques and practices might skew poll results one way or the other. In the final weeks before this year’s election, the practice of “oversampling” and its possible effect on presidential polls is in the media spotlight.

Oversampling is the practice of selecting respondents so that some groups make up a larger share of the survey sample than they do in the population. Oversampling small groups can be difficult and costly, but it allows polls to shed light on groups that would otherwise be too small to report on.

This might sound like it would make the survey unrepresentative, but pollsters correct this through weighting. With weighting, groups that were oversampled are brought back in line with their actual share of the population – removing the potential for bias.

When people think about opinion polls, they might envision taking a random sample of all adults in the U.S. where everybody has the same chance of being selected. When selected this way, the sample on average will look just like the full population in terms of the share that belongs to different groups.

For example, the percentage of men and women or the share of younger and older people should fall close to their true share of the population. For the telephone surveys that Pew Research Center conducts, the process is a little more complicated (in order to account for things like cellphones and the fact that not everyone responds to surveys), but usually we want all adults to have an equal chance of being selected into the sample.

This works very well if you are interested in the overall population, but often we want to know what different kinds of people think about issues and how they compare with one another. When we are interested in learning about groups that make up only a small share of the population, the usual approach can leave us with too few people in each group to produce reliable estimates. When we want to look closely at small groups, we have to design the sample differently so that we have enough respondents in each group to analyze. We do this by giving members of the small group a higher chance of being selected than everybody else.

A good example is a Pew Research Center survey from June of this year, in which we wanted to focus in depth on the U.S. Hispanic population. In the previous survey from March, there were 291 Hispanic respondents out of 2,254 total respondents, or 13% of the sample before weighting. This is pretty close to the true Hispanic share of the population (15%), but we wanted to have more than 291 people responding so we could do a more in-depth analysis. In order to have a larger sample of Hispanics in June, we surveyed 543 Hispanics out of 2,245 total respondents, or 24% of the unweighted sample. This gave us a much larger sample to analyze, and made the estimates for Hispanics more precise.

If we just stopped here, estimates for the total population would overrepresent Hispanics. Instead, we weight them back down so that when we look at the whole sample, the share of Hispanics falls back in line with their actual share of the population. This way, we still have more precise estimates when looking at Hispanics specifically, but we also have the correct distribution when looking at the sample as a whole.

Pew Research Center’s 2014 Religious Landscape Survey also used oversampling in states like Wyoming so that researchers could make reliable estimates about Wyomingites’ religious beliefs and practices. Thanks to oversampling, we interviewed 316 Wyoming residents, instead of an estimated 63 under a non-oversampling design. The survey weighting adjusted for this by aligning the 0.9% of respondents from Wyoming with their actual share of share of the U.S. population (0.2%).

Topics: 2016 Election, Demographics, Research Methods, Telephone Survey Methods

  1. Photo of Andrew Mercer

    is a senior research methodologist at Pew Research Center.

16 Comments

  1. William H. Magill1 month ago

    Why, if you “wanted to focus in depth on the U.S. Hispanic population,” do you include non-hispanics in a survey?

    Similarly, “weighting” is an arbitrary act – especially in a survey where you are attempting to “drill-down.”

  2. Anonymous1 month ago

    But if you are weighting the poll based upon it’s proportionate representation in the voting public, where is the variable for those who vote outside of the party or for the weight of a particular state in the overall national results?

    This type of polling seems to raise more questions of accuracy then solve. And in this type of election, I would say it more resembles the type of polling that was being in 2008 that was incapable of measuring the motivation on one side of the electorate over the other. After all, Mitt was actually up in most polls if not down by a fraction and Obama ended up winning handily.

    I sense the same type of atmosphere this election for Trump.

  3. Anonymous1 month ago

    I would like to see all pollsters agree on the definition of “likely voter.”

  4. Anonymous1 month ago

    Using that logic that would mean that ever poll should be within the same margin of error. How do you get one poll with a 14 point lead and another one with a 1 point lead?

    1. Anonymous1 month ago

      For one thing, sampling errors aren’t the only source of error. Different pollsters may weight data from the same raw sample differently because they’ve made different judgments about what the population as a whole looks like.

      But there’s a more basic point here: no polling result is 100% certain to be within the margin of error. A margin of error is paired with a “confidence interval” that tells you (to oversimplify just a bit) how likely the polling result is to be within the margin of error with respect to the candidates’ actual support in the population. Results outside the margin of error aren’t impossible, they’re just unlikely. Averaging all polls helps account for these outliers (since they’re as likely to occur in one direction as the other) and should also help remove non-sampling sources of error, provided all pollsters aren’t wrong in the same direction.

  5. Anonymous1 month ago

    But political parties and presidential supporters aren’t a race or ethnicity, how do you how many to correct for at any given time?

  6. Anonymous1 month ago

    This assumes that the weighting methods are straightforward. However, oversampling on multiple different axes can turn what would otherwise be relatively simple into an non-trivial problem where our solutions only approximate a reasonable weighting.

    This could be avoided by simply running two experiments, one looking at demographic groups and one pure sample. Anyone willing to risk poisoning their data in order to cut costs should not be considered a reliable source.

  7. Anonymous1 month ago

    Unfortunately for the network media outlets, the Podesta e-mail released by wikileaks showed that he (Podesta himself) advised the pollsters which subgroups to oversample in each region to maximize positive results for HRC. That is the issue here, I’m sure this is true what the researcher says here, but when you oversample in only certain groups to maximize the results for your candidate – that’s when it becomes a problem. This is the fact (read Podesta e-mail) of polling by the media this election cycle…

    1. Anonymous1 month ago

      First. the Podesta emails were speaking about internal polls.

      Second, you are missing the entire point of oversampling. You wouldn’t oversample all groups. It’s not ‘maximzing’ results.

    2. Anonymous1 month ago

      Thats not how oversampleing works. Read the article.

  8. Anonymous1 month ago

    The issue this cycle is the oversampling of Democrats compared with Republicans by 9%, not subgroups.

    1. Aaron D1 month ago

      Subgroups typically vote more Democratic. Also, there are more registered Democrats than registered Republicans.

      1. Anonymous1 month ago

        Yeah only by 4% not 10-15% like most of the polls do in average. Not only that, but in the primary democrat turnout was down 30% and 60% increase for republicans

        1. Anonymous1 month ago

          1. Serious question: How do you know what the difference in party identification should be in this year’s electorate? Remember that most of these polls aren’t measuring party registration, they’re measuring party ID, which is far more fluid.
          2. There’s no correlation between the general election and primary turnout. Primary turnout is mostly a function of the primaries’ competitiveness.

    2. Anonymous1 month ago

      Weighting based on party affiliation would defeat the purpose of a poll in the first place.

    3. Anonymous1 month ago

      If this is a serious fact-based challenge to the methodology, why post the comment anonymously? And how about some data to back up the assertion?