To analyze the news that members of Congress share with their Facebook followers, researchers obtained a complete set of posts created by members of the U.S. Senate and House of Representatives and posted on their official pages between Jan. 2, 2015, and July 20, 2017. Researchers used the Facebook Graph API to download the posts.
The first step in the analysis was to identify each member’s official Facebook page. Many members of Congress maintain multiple social media accounts, consisting of one or more “official,” campaign or personal accounts. Official accounts are used to communicate information as part of the member’s representational or legislative capacity, and U.S. Senate and House members may draw upon official staff resources appropriated by Congress when releasing content via these accounts. Personal and campaign accounts may not draw on these government resources under official House and Senate guidelines.10
Researchers started with an existing dataset of official and unofficial accounts for members of the 114th Congress, and expanded it with supplementary data on members of the 115th Congress from the open-source @unitedstates project. Researchers also manually checked for additional accounts by reviewing the House and Senate pages of members who were not found in the initial dataset. Every account was then manually reviewed and verified.
The research team first examined each account’s Facebook page and confirmed that it was associated with the correct politician. All misattributions were manually corrected by Center experts, resulting in a list of 1,416 total Facebook accounts. Accounts were then classified as official or unofficial based on the links to and from their official “.gov’’ pages. Accounts were considered official if they were referenced by a member’s official house.gov or senate.gov homepage. Congressional rules prohibit linking between official (.gov) and campaign websites or accounts, as well as linking from an official site or account to a personal site or account.
In cases where it was not clear that a Facebook page had ever been used in an official capacity (particularly for members that are no longer in Congress with active webpages), the most recent historical copy of the member’s official webpage was manually reviewed using the Library of Congress online archive to determine if a link to the account had been present when the webpage was active. The resulting list of all official accounts for members of the 114th and 115th Congresses was then used to collect the Facebook posts published by each page between Jan. 2, 2015, and July 20, 2017.
Using the Facebook Graph API, researchers obtained Facebook posts for members of the 114th Congress (2015-2016) between Dec. 30 and 31, 2016, so that members who left office before the 115th Congress began would be included in the sample. On July 25, 2017, researchers obtained posts for members of the 115th Congress (2017-2018).
After obtaining posts, researchers checked the combined dataset and identified a small number of duplicate posts from members of Congress who served in both the 114th and 115th Congresses. The duplicates had been introduced due to changes in their unique Facebook API identifiers, resulting in mismatches between the latest copy of certain posts and older copies that had previously been collected. These duplicates frequently occurred on posts that had been edited or modified slightly – often with nearly identical timestamps and only single character variations (e.g. deleting a space). The unique identifiers of these duplicates were also very similar themselves, differing by only a few digits in specific locations of the identifier string. In all of these cases, the posts’ timestamps were rarely separated by more than a few minutes, and were always within 24 hours of each other.
An additional set of duplicates were also found among posts that were produced by pages that had changed names at some point during the timeframe. These posts most frequently occurred after the end of election season, when a number of politicians change the titles of their Facebook pages – removing suffixes such as “for Congress” or adding honorifics like “Senator” to their name. In these cases, the timestamps and content of the posts were perfectly identical, but the prefixes of the posts’ unique identifiers were different.
There were several patterns across multiple post fields that appeared to distinguish duplicates from unique posts. However, no clear set of rules could be identified that comprehensively explained these patterns, so researchers employed a machine learning approach to isolate and remove the duplicate posts.
First, researchers scanned the entire set of posts for each account using a sliding window of two days, and identified all pairs of potential duplicates within each window that matched either of the following criteria:
- Identical timestamps
- TF-IDF cosine similarity of 0.6 or above, and a Levenshtein difference ratio of 60% or higher, on the text of the post11
From these “candidate duplicates,” a random sample of 1,000 pairs was extracted and manually reviewed. Researchers identified whether or not the two posts in each candidate pair were in fact duplicates. Only 24% were determined to be true duplicates. These results were then used to train a machine learning algorithm, using 750 of the pairs to train the model, and 250 to evaluate its performance. Researchers trained a random forest model using a variety of features representing the similarity of the two posts across different fields, and interactions between these features. The most discriminating features included whether the two posts shared an identical timestamp, the number of digits that overlapped between the posts’ ID numbers, and the difference between the posts’ timestamps in seconds. The resulting model achieved high performance, with an average precision and recall of 98% – of the 250 potential duplicate pairs used to evaluate the model, it missed only 4 duplicates and correctly classified the remaining 246.
The model was then applied to the entire collection of potential duplicates, removing duplicates when detected. In total, 23,849 posts (5% of the original sample) were identified as duplicates and excluded before the analysis began.
The final dataset included only those posts that were produced by a member’s primary official Facebook account during the time in which they were serving a term as a representative or senator in Congress. The resulting dataset contains 447,684 Facebook posts from 581 different members of Congress. Photo and video posts were included in this analysis. The findings presented in this report exclude posts by nonvoting representatives, and only posts produced by members that were active in a given Congress, defined as members that produced at least 10 Facebook posts during that time period. Members that meet this threshold in only one of the Congresses are only included for that specific Congress.
Data processing and outlet classification
In order to identify the individual media outlets that members of Congress link to on Facebook, researchers needed to identify the website from which each of the links was shared. First, researchers used a script to follow each link to its final endpoint, allowing redirects along the way, and then identified its domain. For example, if a member of Congress shared a link to http://pewrsr.ch/2vS4S1x, the script would have followed the shortened link to its expanded version – https://www.pewresearch.org/fact-tank/2017/08/21/highly-ideological-members-of-congress-have-more-facebook-followers-than-moderates-do/ – and then simplified it to just pewresearch.org. This process resulted in a list of outlets that indicated which members of Congress linked to each site, and how many times they linked to it. Without executing this process, links shared via URL-shortening services may not be correctly attributed to the website hosting the actual content.
To determine when (and when not) to collapse subdomains to their root domain, researchers manually reviewed the 4,586 subdomains that were posted by members of Congress at least five times across the sample timeframe. Of these, researchers identified 857 subdomains that varied substantially in their content from other subdomains that shared the same root domain. For these cases, such as http://paulryan.house.gov and http://pelosi.house.gov, the full subdomains were preserved and treated as unique sources. The remaining subdomains were collapsed to their root domain.
Next, researchers developed a classification codebook for determining which sites consisted of national news media outlets. Researchers defined national media outlets as “media sites where the majority of links direct to stories about national issues, events, policies, and members of Congress or the President.” The category included news organizations, national link aggregators, news magazines, and national audience niche content (such as military-focused news publications).
Three researchers classified the an identical random sample of 100 websites in order to ensure that the coding instructions were valid. The coders’ ratings had an average Fleiss’s Kappa of 0.90.
Creating congressional sharing scores
Congressional sharing scores for national news outlets capture the average ideology of members of Congress – measured using DW-NOMINATE – who link to a story from that outlet in a post on their Facebook page. The scores take into account the number of times members shared stories from each outlet.
To determine each member’s political ideology, researchers first obtained DW-NOMINATE ideology estimates, which are based on legislative roll call votes, to capture the ideological position of members of Congress who link to particular websites. After joining the ideology estimates to each media link share, researchers calculated the average ideology estimate for each media outlet included in the study. This is the congressional sharing score. This score runs from positive 0.75 (most conservative) to -0.56 (most liberal) across the time period examined here.
Researchers used regression models to examine the relationship between the congressional sharing scores of each outlet and the number of re-shares, comments, and likes that posts containing links to those outlets received from members’ Facebook audience. The distributions of re-shares, comments and likes were heavily skewed, reflecting a small number of posts that received much more engagement than the average post that contained a link. To address this skew, researchers took the base-10 logarithm of the total number of re-shares, comments and likes for each post.
Next, researchers specified three separate regression models with each kind of engagement as a dependent variable. The key explanatory variables included each outlet’s congressional sharing score, the party of the member of Congress sharing each link and an interaction term between party and the congressional sharing score. This specification allowed researchers to examine the relationship between sharing scores and Facebook engagement conditional on the party of the member who posted the link. The regression models also included random intercepts for each week in the data and for each member of Congress, which help normalize the estimated relationships across members and over time.
List of national news outlets
This report examines all national news outlets whose stories were shared at least 25 times by members of Congress between Jan. 2, 2015 and July 20, 2017. The outlets include:
News sourceCongressional sharing scoreMedian congressional sharing score25th percentile75th percentile
Pew Research Center is a nonprofit, tax-exempt 501(c)(3) organization and a subsidiary of The Pew Charitable Trusts, its primary funder.
© Pew Research Center, 2017