As text and imagery have increasingly become digitized, a new field of computer coding of text has developed. No longer does coding depend on an actual human looking at an actual, physical newspaper and counting the references to, for example, world leaders. Now a computer can scan the digital version of the front page and be trained to identify mentions of those leaders. These algorithms, developed in computer science as well as the social sciences, exist as both open source and commercial tools.
The computer brings several major advantages to coding: It is incredibly fast; it can code a volume of text never before thought possible; and it is wonderfully consistent.
But computer coding has some built-in disadvantages as well. First, it can only work on text that has been digitized (i.e., in the form of a website story) and is publicly available, so certain channels of media are difficult to reliably code, including local television and radio. Second, it remains limited in its ability to judge the kind of nuance that seems relatively straightforward to the human coder: Is this article generally favorable toward the president, or generally unfavorable? Is the writer using sarcasm, or is the writer being serious? Computer algorithms that do this kind of “sentiment analysis” are being created, but to some extent these remain early days. Third, it can sometimes be difficult to understand the shape and limits of the universe of digitized data that is being coded, the days of the discrete, delimited newsprint paper being today behind us.
Thus far, we at Pew Research Center have dealt with two commercial firms that provide computer coding software: Crimson Hexagon (CH) and General Sentiment (GS). We approached these new tools, as is true for any new methodology we adopt, with optimism, curiosity and a battery of tests in hand.
Specifically, researchers at the center spent more than 12 months testing Crimson Hexagon, the first coding algorithm with which we worked. To test the validity of the software, two human researchers coded 200 stories that were also coded by the algorithm. The human coders and algorithm agreed on the coding 81% of the time, passing our general standard for intercoder reliability.
In addition to validity tests of the platform itself, Pew Research conducted separate examinations of human intercoder reliability to show that the process used to train the algorithm to code complex concepts is replicable. The first test had five researchers each code the same 30 stories, which resulted in an agreement of 85%.
A second test had each of the five researchers build his or her own separate CH projects to see how the results compared. This test involved not only testing coder agreement, but also how the algorithm handles various examinations of the same content when different human trainers are working on the same subject. The five separate monitors came up with results that were within 85% of each other.
Following this, Pew Research Center spent four months conducting various tests of GS. In particular, researchers were interested in testing whether GS accurately measures the frequencies of topics. To answer this question, researchers compared results from GS with results based on the work of human coders reading the RSS feeds of several major news outlets.
For example, GS said that on USA Today’s RSS feed on Oct. 2, there were eight references to “Kenya” and four to “Christopher Cruz” (who was involved in a New York City biker attack). That was precisely the same numbers researchers found when looking through USA Today’s RSS feed.
Researchers repeated this same process for a number of other topics and websites, including CNN, the Washington Post and several local TV sites. In each instance, we were able to match up GS results and the RSS feeds.
Pew Research Center will continually test these and other automated measurement tools in order to find the most accurate and valid methods for advancing the center’s research agenda.