About Data Labs
Pew Research Center’s Data Labs uses computational methods to complement and expand on the Center’s existing research agenda. The team collects text, network and behavioral datasets; uses innovative computational techniques and empirical strategies for analysis; and generates original research. Data Labs also explores the limitations of these data and methods and works toward establishing standards for use and analysis.
Data Labs produces its own reports and collaborates with other research groups at the Center, applying new computational approaches to existing research questions. The team hosts guest speakers, invites graduate students to serve as summer fellows and organizes methods-focused workshops for the Center’s staff.
Data Labs also manages the Center’s computing infrastructure. That includes building high-performance computing systems and databases that facilitate web data collection and processing; deploying platforms that facilitate collaborative, replicable analysis in R and Python; and developing systems to automate research tasks such as content classification for machine learning.
As is true for Pew Research Center as a whole, Data Labs is nonpartisan and nonadvocacy. The team values independence, objectivity, accuracy, rigor, humility, transparency and innovation.
Why did Pew Research Center create Data Labs?
Pew Research Center created Data Labs in response to the changing nature of data on human behaviors and attitudes. Technological advancements have resulted in explosive growth of new forms of data relevant to public policy and civic engagement. In unprecedented ways, the public is expressing views online and leaving behind electronic trails of behavior: whom they connect to in social networks, what they search for and what content they post on social media. At the same time, speeches, press announcements and debates by policymakers and political candidates are now archived in digital repositories and made available online.
While most of these digital traces of communication and behavior are unstructured and not amenable to analysis in raw form, a number of new technologies are making it easier to collect and process these data. These technologies include:
- Online distributed labor platforms: These platforms allow for the dividing of a major data collection effort into a series of small tasks that can be completed by individuals externally, using platforms such as Crowdflower and Amazon’s Mechanical Turk (AMT) service. This is sometimes referred to as “crowdsourcing.”
- Internet data collection: This includes harvesting the content of web pages and parsing out fields (e.g., dates, names, links and tables) for analysis; and querying APIs online to obtain formatted data that requires only minimal parsing.
- Natural language processing (NLP): This includes processing raw human language including text, speech and video in order to produce useable data such as topics discussed, entities mentioned and sentiment.
- Machine learning: This is the process of using algorithms that can learn from and make predictions based on raw or processed data and is often applied to text or images.
- Network data: This involves analyzing the pattern of connections between people, the way information flows between people and/or the things people have in common, often to generate insight about an entire social system.
- Experimental and quasi-experimental designs: The process of leveraging a source of random/ignorable variation to estimate a causal effect.
Data Labs is a testing ground for these data sources and approaches to analyzing them, with the goal of extracting meaning from the data through creative design, innovative methods, thoughtful measurement and sound deployment.
Logos Political Data Initiative
In its first year, Data Labs has focused on building its Logos Political Data Initiative, which facilitates research on what elected officials say and how that relates to whom they represent and the way they govern. The initiative’s database presently includes congressional press releases, media coverage and social media posts. These data exist alongside measures of legislative voting behavior, district composition and constituent characteristics, and information about campaign finance and political donations. The team’s first report using these data can be found here.
To this end, the team designed and built a custom data management and analysis system using the Python-based Django web development framework. Django allows the team to efficiently manage its Logos database, which is built in PostgreSQL and houses data from dozens of sources.
Logos contains information about the attributes and rhetoric of almost 15,000 unique politicians and spans both candidates and elected officials. The primary text data sources used to capture political rhetoric include Facebook posts accessed via the Open Graph API, press releases collected from official congressional websites and press releases issued by wire services and collected by Lexis Nexis.
Basic information about individual members and candidates for Congress comes from the United States Github repository (maintained by GovTrack.us, Sunlight Labs and ProPublica). In addition, Logos includes campaign finance information from the Federal Election Commission; legislative voting records from Voteview.org; and background information from Ballotpedia, CQ Roll Call, Wikipedia entries and official congressional biographies. Logos also contains the full text and metadata for bills and resolutions from the U.S. House and Senate.
In order to assess political representation, Logos also incorporates state and district-level data, drawn from the 2013 American Community Survey five-year estimates. Logos contains information on all 441 current congressional districts, 502 historical districts (prior to the 2010 redistricting) and 56 states and territories. District and state information also include data from The Cook Political Report and other sources.