Contending with the 80/20 rule when studying online behavior

Pew Research Center illustration

Much of the work by the Data Labs team at Pew Research Center examines online behavior, especially on social media platforms. In recent years, we’ve matched (with their permission) more than 1,000 members of our American Trends Panel to their Twitter handles, examined the Facebook and Twitter presences of every member of Congress and built a database of every YouTube channel with at least 250,000 subscribers. These projects and others are aimed at helping us understand how people use social media and gain insights into the content on these platforms.

One issue we often have to contend with in these analyses is the Pareto principle, which is also called the “80/20 rule.” The Pareto principle holds that in many systems, a minority of cases produce the majority of outcomes.

This relationship can manifest in many different arenas. For example, we’ve found that the most active 10% of Twitter users produce 80% of all tweets from U.S. adults; that the 10% of U.S. congressional lawmakers with the most Facebook and Twitter followers receive roughly 80% of all audience engagement; and that 10% of the most popular YouTube channels produce 70% of all the videos from that group.

None of this is inherently problematic. But depending on our method of data collection, it does place limits on the behaviors we’re able to study. Here are a few questions related to the 80/20 rule that we often work through when studying social media.

Do we use the mean or the median when examining a behavior? Because many online outcomes (such as posting volume or engagement) are concentrated among a small subset of accounts, the mean value of a given behavior is often larger than the median value for that same behavior.

Here’s a real-world example of how this might play out. Let’s say we’re trying to figure out whether Republican or Democratic lawmakers have more followers across their accounts on Facebook and Twitter. If we simply run a numerical average for those two groups among members of the 116th Congress, it would appear that Democrats on average have nearly 480,000 followers across these two platforms, while Republicans have just over 260,000 — a difference of more than 218,000 followers per member.

But if we take a closer look at these two groups, we see that only five members of the House or Senate in the previous Congress had more than 10 million followers across their various social accounts. And four of those five outliers (Bernie Sanders, Kamala Harris, Alexandria Ocasio-Cortez and Elizabeth Warren) are Democrats or caucus with the party. Sure enough, if we use the median instead of the mean, we can see that the typical Democratic legislator has just over 61,000 followers while the typical Republican has just over 50,000. That’s still a difference in followers between the two parties but a much smaller one than what we saw when we ran this as a simple numeric average.

This example indicates that outside of a small number of hugely popular accounts, the typical Republican lawmaker has a following that’s much closer to that of the typical Democrat than we might have gathered simply from looking at the means. And because we are often trying to measure the experiences of typical users, in our reports we usually report on medians rather than means.

When looking at the prevalence of something on social media, do we measure it by counting posts or counting people? In our reports, we often present our findings by focusing on people — for example, the share of U.S. adults on Twitter who have posted about a given topic. That’s partly because the Center has its roots in traditional public opinion research, and we are especially interested in measuring what people say and do. But there’s also a practical reason that has to do with the 80/20 rule.

Here’s how this might play out. Let’s say we have a collection of 1,000 Twitter accounts belonging to U.S. adults, and 999 of them have tweeted exactly once in the time period we’re interested in. But one person is extremely active — and maybe even a bit obsessed with one topic. That person has posted 1,000 tweets all on their own during that time, and every one of them is about the Philadelphia Eagles football team. (This scenario is fictional, although Eagles fans are indeed known for their passionate and outspoken views.)

If we just counted up all the tweets in the sample and examined what they were about, we’d conclude that America is obsessed with the Philadelphia Eagles. After all, half of all tweets in our sample are about that one subject! But in reality, only 0.1% of the people in our sample have mentioned the Eagles at all. By presenting our findings based on users rather than posts, we can more accurately depict the experiences of any given individual in the population and balance out the influence of extremely active accounts.

Do we have an adequate sample size to identify specific behaviors or go deep into particular groups? This issue is especially relevant to our work with U.S. adults on Twitter, where we have recruited a representative sample of users to stand in for a broader population. Because a small share of users produce most tweets, the bulk of the people we sampled tweet extremely rarely, if ever. In fact, the median American on Twitter posts just one tweet per month. That means we simply don’t have a lot of usable data about these individuals’ posting behaviors.

Here’s how this can limit our ability to analyze particular topics. Let’s say we want to do a deep-dive analysis of the demographics and attitudes of people who posted the #BlackLivesMatter hashtag on Twitter. This is one of the most popular hashtags in history and was used nearly 50 million times on all of Twitter in just two weeks in 2020, shortly after the police killing of George Floyd in Minneapolis.

But although we can easily identify public tweets containing that hashtag using the Twitter API, we know very little about the opinions, demographics or other personal characteristics of the people sharing them. That is something our survey panel of U.S. adults with matched Twitter handles could help us understand.

Using that sample, we have estimated that 3% of U.S. adults on Twitter used the #BlackLivesMatter hashtag between November 2019 and September 2020. But unfortunately, that 3% figure works out to just 148 actual respondents from our survey panel. And once we account for the design effect of our survey, that’s an effective sample size of just 77 adults, which is simply not large enough to conduct a robust analysis of that group as a stand-alone entity. And notably, this is for a relatively common behavior. Many online actions — whether tweeting about a particular topic, following a particular account or posting a link to a particular article — are going to be too rare for us to measure at all with any degree of specificity.

The issues mentioned earlier in this post can be addressed by shifting our analytic frame, but there isn’t much of a fix for this one, short of recruiting many more members to our survey panel, which comes with significant logistical and financial challenges. This is simply an inherent limitation of social media collections with “only” a few thousand respondents or accounts.


More from Decoded

About Decoded