The Pew Research Center used two different software platforms to determine the volume of attention paid to Pope Francis and Pope Benedict. To determine the amount of coverage in the 25 most visited news websites, Pew Research used a tool from General Sentiment. To determine the volume of attention on Twitter, Pew Research used software from Crimson Hexagon.
The majority of the data in this report come from the period between March 13, 2012, and Jan 31, 2014.
How Volume in the Media was Determined Using General Sentiment
The Pew Research Center is employing an automated coding platform created by General Sentiment (GS), a social analytics company, to conduct research into the volume of media coverage given to key topics. General Sentiment’s software tracks the volume of mentions for a person, place or subject online during any time period. The platform analyzes over 60 million sources of online content. GS saves all the public content from the RSS feeds of these sites, allowing users to track the volume of the content through natural language processing and text analytics.
This software has been used by a number of news organizations, businesses, and academics.
How General Sentiment Tracks Topics
The basis of General Sentiment’s software is a proprietary computer-learning method that discovers and tracks millions of topics contained in RSS feeds. The process is called “automated topic discovery.”
Each evening, GS’ computers scan all text contained in the millions of RSS feeds they follow and catalog the “topics” discussed each day. Topics can be names, places, hashtags or other common phrases. Most topics are proper nouns.
Once GS’ system catalogues a new topic, it stores information about that topic going forward. From then on, any time the topic appears on the RSS feeds, the system notes that occurrence. This allows for advanced tracking that identifies how often, and where, topics are mentioned online.
For example, imagine a story about a small-town mayor named Jane Rodgers that appeared on the RSS feed of a local newspaper. GS’ computer-learning system would identify the term “Jane Rodgers” as a new topic – if it were the first time that name appeared in public online content – and would then track every instance where “Jane Rodgers” appeared going forward. (In the cases where a name is common, there are advanced Boolean options that help distinguish between different people.)
There are two major benefits to this type of automated topic discovery process as compared to traditional keyword searches. The first is that GS can list the most frequently mentioned topics discussed on any given day, week or month. By definition, keyword searches are only effective when the researchers know what they are looking for. Some news stories, however, emerge even though researchers may be unaware of them. GS lists the most popular topics on a daily basis ensuring that we are not missing any major stories.
The second benefit is that the GS search platform allows for easy combinations of all relevant word variations of a topic. For example, when searching for mentions of President Barack Obama, the software also includes references that read “Pres. Obama,” “Barack,” “President Obama,” along with pronouns that are clearly referring to the same person. These types of detailed searches are difficult with traditional keyword searches.
The unit of measure for the GS volume statistics is the reference or mention as opposed to the story or sentence. That means that if a person is mentioned five times within the same post or article, it counts as five mentions.
An RSS feed is a web format that allows sites to syndicate frequently updated information automatically. General Sentiment tracks the RSS feeds of over 60 million online sources including national media, local media, blogs and social media. Users can sign up to view RSS feeds of sites, and see anytime new content appears. RSS feeds are also good tools for services like GS to collect large amounts of content in a systematic way.
For this particular report, Pew Research searched the 25 most-visited news sites according to audience data from Hitwise and comScore. The sites were as follows:
GS only tracks text available online through a site’s RSS feed. It does not include video or television transcripts. It also excludes duplicated stories that appear in more than one location. In other words, if two stories on different sites have the exact same text, only one of those stories will be included. If there are any differences at all between the texts of two stories – even one word change – then the two stories will both be included. If a specific wire story, for example, appears as a 1,000-word piece on one site, but as an edited version of 800 words on another, both versions of those stories are included in GS’ sample. This is done so that posts on aggregator sites that link to stories from other locations do not get counted repeatedly, Counting repeated stories appearing on different pages could have a disproportionate influence on results. This “deduplication” process is less of an issue when Pew Research examines a small group of major sites, since large news producers tend to include fewer stories produced by wire services and other outlets.
Unlike some other automated tools, GS has the capability to separate the text of reader comments from the text of news articles. This assures that the content measured is text produced by the news organizations and is not confused with responses by readers.
Pew Research spent four months conducting various tests of GS. In particular, researchers were interested in testing whether GS accurately measures the frequencies of topics. To answer this question, researchers compared results from GS with results based on the work of human coders reading the RSS feeds of several major news outlets.
For example, GS said that on USA Today’s RSS feed on Oct. 2, there were eight references to “Kenya” and four to “Christopher Cruz” (who was involved in a New York City biker attack). That was precisely the same numbers researchers found when looking through USA Today’s RSS feed.
Researchers repeated this same process for a number of other topics and websites including CNN.com, Washingtonpost.com and several local TV sites. In each instance, we were able to match up GS results and the RSS feeds.
A small caveat is that GS is dependent on what is freely available on RSS feeds. A few sites, such as the New York Times, put some of their content behind a paywall. In those instances, only free content is included in GS’ sample.
Pew Research will continually test GS, along with other automated measurement tools, in order to find the most accurate and valid methods for advancing the Center’s research agenda.
How Volume and Sentiment on Twitter was Determined Using Crimson Hexagon
The analysis of Twitter employed media research methods that combined Pew Research’s content analysis rules with computer coding software developed by Crimson Hexagon (CH). This report is based on examinations of more than 13 million tweets.
Crimson Hexagon is a software platform that identifies statistical patterns in words used in online texts. Researchers enter key terms using Boolean search logic so the software can identify relevant material to analyze. Pew Research draws its analysis sample from all public Twitter posts. Then a researcher trains the software to classify documents using examples from those collected posts. Finally, the software classifies the rest of the online content according to the patterns derived during the training.
Two different analyses were conducted for this project. The Boolean search used to identify tweets about Pope Benedict was (Pope OR Benedict OR Ratzinger). The Boolean search used to identify tweets about Pope Francis was (Pope OR Francis OR Bergoglio).
Tone of Twitter Response
Reaction on Twitter can often be at wide variance with public opinion. A Pew Research Center analysis last March compared the results of national polls to the tone of tweets about eight major news events and found that the Twitter conversation can be more liberal than survey responses, while at other times it is more conservative. During the 2012 presidential campaign, Twitter sentiment was much more critical of Republican candidate Mitt Romney than of President Obama.
Researchers classified more than 250 documents in order to “train” these specific Crimson Hexagon monitors. All documents were put into one of four categories: positive, neutral, negative or jokes. A tweet was considered positive if it clearly praised the pope in question. A tweet was considered negative if it was clearly critical of that pope.
CH monitors examine the entire discussion in the aggregate. To do that, the algorithm breaks up all relevant texts into subsections. Rather than the dividing each story, paragraph, sentence or word, CH treats the “assertion” as the unit of measurement. Thus, posts are divided up by the computer algorithm. Consequently, the results are not expressed in percent of newshole or percent of stories. Instead, the results are the percent of assertions out of the entire body of stories identified by the original Boolean search terms. We refer to the entire collection of assertions as the “conversation.”