February 23, 2010

Details About The Statistical Analysis Behind The Quiz

The Millenials

This is part of a Pew Research Center series of reports exploring the behaviors, values and opinions of the teens and twenty-somethings that make up the Millennial Generation

Your Millennial score is your predicted probability of being in the Millennial age group (currently ages 18-29). A score of 51 or higher means the chances are better than 50-50 that in your values, attitudes and behaviors you resemble the typical Millennial. Data from the Pew Research Center’s 2010 Millennial Survey were subjected to a multiple stage statistical analysis to identify the best predictors of being a Millennial, and then to create the optimal way to combine the questions so as to achieve the most accurate prediction.

Even before the survey was conducted, we reviewed a great deal of research about generational differences, as well as a wide range of previous Pew Research Center surveys to identify questions that are strongly correlated with being a member of the Millennial generation. The best of these questions were included on the new survey.

The goal of the first stage of the analysis was to identify a small set of survey questions that would, in combination, provide an accurate prediction of which respondents in the survey were members of the Millennial generation and which ones were not. In essence, we sought to find the questions in the survey that are most strongly correlated with being a Millennial, and not a member of another age cohort.

It was also important to have questions that covered a broad range of traits, attitudes and behaviors, since the Millennial generation is distinct in multiple ways. For this reason, the first step in the statistical analysis was by a search through a very broad set of questions (approximately 30 items in the survey were tested) using a statistical procedure called “stepwise regression.” The analysis first locates the question that had the strongest correlation with being a Millennial, and then looked for another question that provided additional predictive power, a third question, and so forth, until no more questions could be identified that significantly improved the overall prediction of being a Millennial. We identified most of the 14 questions in the final quiz using this process, but also substituted a few questions to help broaden the overall content of the quiz.

The second step was the use of logistic regression, a technique employed to estimate the independent contribution of multiple factors in predicting a particular outcome – in this case, being a member of the Millennial generation. For each respondent in the survey, this procedure produces a predicted probability of being a Millennial. The goal of the analysis is to find the optimal combination of the selected survey questions that most accurately classifies each respondent as Millennial or not — that is, the combination that tends to assign a higher probability to the Millennials in the survey and a lower probability to everyone else. Because the statistical procedure “knows” whether each respondent is, in fact, a member of the Millennial generation (the actual age of the respondent was collected in the interview), it can assess its predictions against the reality of whether the respondent is actually a Millennial or not, and adjust the combination accordingly.

The result of the logistic regression analysis is a set of coefficients, or relative weights, for each question in the quiz. This allows different questions to have a different impact on the overall score, depending on how strongly each item is related to being a Millennial. And indeed, some items were particularly powerful predictors (e.g., having a social networking profile), while others make a more modest contribution (e.g., political ideology). When you took the quiz just now, your answers were weighted by these coefficients to produce your overall score. The math is straightforward: the answer to each question was multiplied by the coefficient for that question, the products were summed up (along with a constant term), and then converted to a probability1 — your Millennial score.

The quiz was created by Pew Research staff members Leah Melani Christian, Russell Heimlich, Michael Keegan, Scott Keeter, Alicia Parlapiano, Michael Piccorossi and Paul Taylor, and consultant Courtney Kennedy.

Below is a “box plot” – each box shows where the middle 50% of the respondents in each cohort rated. The line in the middle of the box is the median; the high end of the box is the 75th percentile, and the low end is the 25th percentile. The “whiskers” (the lines sticking out of the boxes) show the range of the nearly all other respondents in that cohort, except for the outliers. The points outside the whiskers represent the outliers (more than 1.5 times greater than the range between the 25th and 75th percentiles).

Finally, here are some histograms showing how many members of each generation scored at particular points along the scale:


1 The formula for this conversion is (=EXP(SUM)/(1+EXP(SUM)); expressed in words it is the value of e raised to the power of the sum, divided by 1 plus the value of e raised to the power of the sum.