Bots are a part of life on Twitter, but determining just how widespread they are can be tricky.

A recent Pew Research Center study explored the role bots play in sharing links on Twitter. The study examined 1.2 million tweeted links – collected over the summer of 2017 – to measure how many came from suspected bot accounts. The result: Around two-thirds (66%) of the tweeted links the Center examined were shared by suspected bots, or automated accounts that can generate or distribute content without direct human oversight.

Like any study of bots on Twitter, the analysis first needed to answer a fundamental question: Which accounts are bots and which accounts aren’t? In this Q&A, Stefan Wojcik, a computational social scientist at the Center and one of the report’s authors, explains how he and his colleagues navigated this question. You can also watch this video explainer with Wojcik to hear more about the methodology of the study.

How can you determine if a Twitter account is a person or a bot?

Stefan Wojcik, computational social scientist at Pew Research Center
Stefan Wojcik, Pew Research Center computational social scientist

It’s a challenge. It’s a burgeoning field and there is always a degree of uncertainty. But the best way is to look at what a particular account is doing. What kind of content is it sharing? Do the tweets convey human-sounding messages? What other accounts does it follow? Has the account tweeted every five minutes for its whole lifespan?

You can come up with a list of characteristics like these to try to determine whether an account is a bot or not. Of course, it would be far too time-consuming to try to observe those characteristics for 140,000 different Twitter accounts (roughly the number of accounts included in the study). A more practical approach is to come up with a reasonably large dataset of accounts that are bots and not bots, and then use a machine learning system to “learn” the patterns that characterize bot and human accounts. With those patterns in hand, you can then use them to classify a much larger number of accounts.

We investigated different machine learning systems that have been tested publicly. Based on its successful application in past research and our own testing, we selected a system called Botometer.

What is Botometer, and how does it work?

Botometer is a machine learning system developed by researchers at the University of Southern California and Indiana University. The system was trained to recognize bot behavior based on patterns in a dataset of over 30,000 accounts that were first verified by human researchers as either bots or non-bots. Botometer “reads” over a thousand different characteristics, or “features,” for each account and then assigns the account a score between 0 and 1. The higher the score, the greater the likelihood the account is automated. The tool has been used in a number of academic studies and other independent research.

In your study, you set a Botometer score of 0.43 as the threshold between a non-automated account and an automated one. How did you arrive at that threshold?

As others have done in the past, we needed to say whether an account could reasonably be suspected of employing automation – being a “bot.” So we set a threshold, which we selected in a way that would minimize two different kinds of error. Using a Botometer score that was too high would have meant incorrectly classifying many bots as human accounts – otherwise known as a false negative. On the other hand, if we had set a threshold that was too low, we would have incorrectly labeled lots of human accounts as bots – a false positive.

Which type of error is “worse?” It’s a complicated question, and the answer depends on what you want to accomplish. We wanted  the most accurate, 10,000-foot view of the prevalence of bots sharing links on Twitter, so we set the threshold in a way that maximized accuracy.

Using a Botometer score that was too high would have meant incorrectly classifying many bots as human accounts – otherwise known as a false negative. On the other hand, if we had set a threshold that was too low, we would have incorrectly labeled lots of human accounts as bots – a false positive.
Stefan Wojcik

We did that by conducting a human analysis of a subset of the Twitter accounts in our study and then using the results to determine which Botometer threshold would minimize the share of false positives and false negatives in the larger sample.

This analysis, which is informed by human judgments, is an alternative to choosing an arbitrary threshold, which the developers of Botometer explicitly discourage. Our tests eventually led us to settle on a threshold score of 0.43, which is similar to what the Botometer team itself has found to maximize accuracy for a large sample.

We also went back and looked at accounts that Twitter had suspended as part of its efforts to improve the platform since we collected our data. We found that accounts we suspected of being bots were suspended at higher rates than accounts we identified as human.

Aren’t there some Twitter accounts that are above your threshold but are not bots? And aren’t there some accounts that are below your threshold but are bots?

Yes, there are some. Several people who read our study pointed this out after they tested their own Twitter accounts against our threshold. But it’s important to remember that we calibrated this threshold in order to get to get an average estimate of the big-picture role bots are playing in producing tweeted links, not to determine whether particular individual accounts were bots. If that were our goal, we might have used a different method, one that focused more on minimizing false positives.

Measurement error is a natural part of machine learning, and scientific measurement more broadly. Surveys, for example, also have measurement error that can result from poorly worded questions or inattentive respondents, in addition to the more familiar sampling error. So it’s not surprising to see false positives or false negatives when using this system.

Many institutional Twitter accounts – like those of news organizations that tweet multiple links to the same article each day – may demonstrate bot-like behavior even though they are not bots. How did your study account for these kinds of accounts?

We recognized that as a potential issue. If institutional accounts were responsible for a substantial amount of tweeted links, then our understanding of bot behavior might be very different. So we performed a test to see what impact – if any – these “verified” accounts might have had. We removed verified accounts that were classified as bot accounts and reran our analysis. We found that the percentages of tweeted links posted by bots were virtually the same, with or without verified accounts. This gave us confidence that our results were not primarily driven by these verified institutional accounts.

What takeaways about machine learning in general emerged from this project?

Machine learning can be a valuable tool for research. It can be especially helpful when examining large amounts of social media data or other digital trace data on the web. In fact, in recent years, Pew Research Center has expanded its research using machine learning.

We also know that machine learning is a growing field and that there is always a degree of uncertainty in how well particular approaches work. We feel that the best way to use this tool is to be transparent in the decisions we make, be open about the possibility for error and be careful when interpreting our findings. We’re eager to contribute to the advances being made in natural language processing, applied statistics and machine learning, and we look forward to exploring their advantages and limitations.

John Gramlich  is an associate director at Pew Research Center.