Q&A: What works and what doesn’t when studying social media data

(Pew Research Center photo illustration by Abby Budiman)

Ever since Friendster and Myspace gained popularity in the early 2000s, social scientists have been interested in studying the impact of social media sites. More than a decade later, the use of social media platforms is widespread in U.S. society — and so is research into the way Americans use these platforms.

Pew Research Center has enjoyed being part of this explosion of new research. We’ve produced a number of reports using data science tools, including analyses about how members of Congress communicate with their constituents on Facebook, how much of a role bots play on Twitter and how Twitter users shared information about immigration during the first month of the Trump presidency.

These research projects have taught us a great deal about the benefits of studying data from social media, as well as the challenges. In the following Q&A, some of the Center’s data science researchers share their lessons learned from our social media data research. The answers have been edited for space and clarity.

What can the study of social media add to traditional surveys?

Dennis Quinn, data science analyst: With polling, researchers can exercise precise control over the prompts their respondents are offered, and there is a vast body of research underlying survey methods. But the world is really big, and there’s room for a lot of information. Social media can help fill in some gaps. Social media data are organic: People just made them up. This can make it hard to distinguish real meaning — signal — from noise. But if you do it right, it can tell you a lot about the way people express their views in the world, which helps color in what they would tell you.

Kenneth Olmstead, research associate: For an opinion to make it onto a social media platform, the person must first have that view, choose to put it out into the world and then shape how they want it to be expressed. In a survey, the respondent could be asked about something they have not put much thought into. Compared with surveys, social media data are an expression of attitudes in a way that’s less controlled by the researcher.

Adam Hughes, computational social scientist: Social media data are far from perfect, as we know. That said, a biased and somewhat murky signal can still be informative. For example, when Twitter users post hashtags, they appear to be focused on the particular topic. This action tells us something about what Twitter users are paying attention to, even if we can’t necessarily say who they are, why they started paying attention or what they make of it.

What are some of the limitations of studying social media?

Onyi Lam, computational social scientist: Since social media data are not usually representative of the public as a whole, it may not be worthwhile to collect a random sample in the hopes of saying something about the general public. A better approach may be to collect data from a specific, well-defined group, such as members of Congress or news media organizations.

Dennis Quinn: Organic data can be very messy, and they often require a lot of planning and strategy to collect. As a result, you sometimes find yourself making advanced analytical decisions at the first stage of the process. It’s just a fact of big data that research design, data collection and analysis all meld together. Sometimes you realize in the final stage of a product that you made a big mistake in the first. The way to deal with this is to adopt a more cyclical, more iterative approach to your research: Collect a little data and analyze it. This lets you make your mistakes early, in small doses, before you make them later, in big doses. Striking the right balance between linear planning and iterative development is one of the biggest challenges.

Stefan Wojcik, computational social scientist: You’re also limited to how the platform designs its interface. It constrains the conversation in various ways that don’t necessarily maximize what a researcher would want to maximize.

Kenneth Olmstead: Access to data is a huge challenge. Many researchers turn to Twitter for data because it is largely available — with some limitations — and mostly public. But Twitter is used by a relatively small portion of the population. Facebook, on the other hand, is used by a larger portion of the population, but most of what is posted to Facebook is unavailable to researchers. Understanding that available data is often not the entire picture of that platform is key to analyzing the data.

Skye Toor, data science assistant: One common misconception is that the size of data is a major obstacle. In reality, it’s the analysis that is usually more challenging. In most cases, the marginal cost of adding more data is low. The marginal cost of doing more analysis is much higher. That’s where more time is spent.

Mike Barthel, senior researcher: The biggest question — and one we can’t always answer — is always, “How do these data translate to people?” On social media, each account could be a person, but it could also be a duplicate account or could be an automated account, and it may have very sparse information about what that person is like, demographically or otherwise.

The study of social media often involves new, evolving tools and computer learning techniques. How do you determine which tools and techniques meet your standards for publication?

Adam Hughes: The most important part of adopting a new method is validation. For example, we check to ensure that our machine learning models correspond with human judgments about what social media posts are trying to convey. If a predictive model performs much worse than human content coders do — or if it were no more informative than flipping a coin — we wouldn’t use it.

We also value transparency. The tools we use should be well-documented and, when possible, open source. If a model or technique does not have a detailed explanation, underlying code or academic research supporting it, we approach it with lots of skepticism. We also make every effort to validate the algorithms we use, no matter where they come from.

Stefan Wojcik: I think it’s not so much about the tools themselves being improper for publication, but the improper application of them. People sometimes apply things like topic models to random samples of Twitter data and then make claims about what the broader public is talking about. One needs to think carefully about the population, how much text they produce and how it can be reasonably compared.

When approaching a project involving social media data, do you start with a research question or a dataset? Is there a general rule of thumb?

Stefan Wojcik: It can work both ways, but it’s usually better to start with a research question. Otherwise, you end up rudderless. You need some frame of reference to begin a study, because it allows you to make tradeoffs in your research design that ultimately maximize your ability to understand the thing you want to understand. But if you just start with data, there’s nothing to maximize. Data are just data.

Onyi Lam: This is a contentious question! I think different researchers have different preferences, but ultimately, we all have to go back and forth between refining the question and understanding the data from a given data source.

Adam Hughes: We usually start with research questions to avoid post-hoc reasoning. It’s easy to get a big dataset, mine the data for patterns and then come up with stories about what the patterns mean. But that approach can result in stories that don’t generalize beyond the particular dataset you are looking at. The better approach is to draw on your substantive knowledge to develop hypotheses before doing any data analysis.

Skye Toor: Start collecting data with a plan of what you want to analyze both for research integrity reasons and in order to keep the scope of any project manageable.

Mike Barthel: It depends how familiar the data source and analysis techniques are. For areas we’ve delved into several times already, we can go in with a research question knowing the limitations we’re likely to face. For new data sources, we’re likely to spend extensive time understanding their characteristics before we narrow in on a research question we can actually answer. Of course, we’d never explore a data source in the first place unless we think it’d be of some use to our general research agenda.

What are some of the biggest technical challenges you have faced when working with large amounts of social media data?

Onyi Lam: Cleaning the data is no doubt the biggest challenge!

Skye Toor: The biggest challenge isn’t the size of the data. It’s the quality.

Adam Hughes: When analyzing Facebook posts, we found a huge number of apparent duplicates. The Facebook API (Application Programming Interface — a common method for retrieving large amounts of data from a website) didn’t explain why the duplicates were there. We ended up building an entirely separate machine learning model to detect and remove them.

Mike Barthel: Broadly speaking, the challenge is always that you can’t simply dive into the data and start analyzing. You need to check to make sure every step executed the way you thought it would be, that nothing got lost or corrupted and that everything is working. In survey data, I can always go in and look at an individual row to find out why data is wonky. I can’t do that with big data sets. So all the checks need to be automated, and they take time.

Kenneth Olmstead: Compute time can be an issue. In the report writing process we often find we need the data sliced in some new way or rerun differently for some reason. With very large datasets, that can often mean having to wait days for the data to run before we can put it back in the report.

What are some of the challenges of writing research reports that rely on data science tools but are intended for a general audience?

Mike Barthel: Survey work is standardized enough at this point that we can simply meet industry standards and then mention the details of how we did what we did when we explain our methodology. For instance, we rarely mention in the body of a Pew Research Center report that survey data are weighted (though you would always find that level of detail in the methodology section). For data science tools, everything is still new and there aren’t yet universally accepted industry standards for how to use machine learning, so we want to be fully transparent about our decisions and their limitations at the outset. We try to give more detail about how we got to our conclusions. As a result, it can be hard to provide our major findings without those caveats — which can mean that the substance of the findings is a little harder to follow.

Adam Hughes: One of the biggest challenges is that error in machine learning models is not as easy to describe as a survey’s margin of error. We report statistics like precision and recall (which convey information about a predictive model’s false positive and false negative rate), and we provide information about the agreement between a model’s decisions and human decisions. But for many readers, these quantities may be harder to interpret than a statement about a particular point estimate’s uncertainty and the margin of error applied to probability survey sampling. So it might be the case that readers become overconfident in a particular finding.

Jargon is definitely an issue. The language of data science is often unfriendly to outsiders. To explain what we mean by “support vector machine,” “random effects,” “word2vec,” and related terms, we write terminology sections in our reports that explain what we are talking about.

Dennis Quinn: Metaphors help a lot, as do short, declarative sentences. Sometimes you really have to break something down step-by-step in a way that might make you look silly to an expert — but you just have to do that. If you’re too afraid to look silly by over-explaining something, then some of your readers aren’t going to understand it.