Methodology

To conduct this analysis, researchers collected every Facebook post and tweet created between Sept. 8, 2016 and Dec. 8, 2016, and Sept. 3, 2020 and Dec. 3, 2020, by any accounts managed by every voting member of the U.S. Senate and House of Representatives. Researchers used the Facebook Graph API, CrowdTangle² API and Twitter API to download the posts. The resulting dataset contains nearly 166,000 Facebook posts from 698 different members of Congress who used a total of 1,408 Facebook accounts, and more than 357,000 tweets from 669 different members of Congress who used a total of 1,438 Twitter accounts.

This analysis includes all text and some metadata information on media attachments from these Facebook and Twitter posts, including image captions and emojis. Photo and video posts were not included in this analysis unless the post also contained meaningful text, such as a caption. Text that appeared only within images was not included in the analysis. Posts by nonvoting representatives were also excluded.

The broader data collection process is described in more detail here.

Distinctive terms and keywords that produced high levels of audience engagement

Researchers conducted distinctive terms and engagement analysis using the complete set of 520,791 Facebook posts and tweets created by members of Congress from Sept. 8, 2016 to Dec. 8, 2016, and Sept. 3, 2020 to Dec. 3,2020.

Text from each document (post) was converted into a set of features representing words and phrases. To accomplish this, researchers applied a series of pre-processing functions to the text of the posts. First, researchers removed 3,109 “stop words” that included common English words, names and abbreviations for states and months, numerical terms like “first,” and a handful of generic terms common on social media platforms like “Facebook” and “retweet.” The text of each post was then converted to lowercase, and URLs and links were removed using a regular expression. Common contractions were expanded into their constituent words, punctuation was removed and each sentence was tokenized using the resulting white space. Finally, words were lemmatized (reduced to their semantic root form) and filtered to those containing three or more characters. Terms were then grouped into one-, two- and three-word phrases.

Terms producing outsized audience engagement were identified using a multi-stage process. For each year, party, platform and term size combination, researchers trained two L2-penalized ridge regression models (which were fit using stochastic gradient descent): one to predict the logged number of favorites or reactions a post received and another to predict the logged number of shares or retweets. Each model attempted to predict these values using two sets of features: binary flags (“dummy variables”) for each politician, and binary flags indicating whether or not each post mentioned any keyword or phrase that was used by at least 20% of the active politicians in a given election period and in at least 0.1% of the posts.

After each model was trained, researchers predicted the favorites/reactions and shares/retweets for each word or phrase flag and each politician and calculated the keyword’s predicted effect for the median politician. These effects were then compared against the predicted engagement for a post from the median politician that didn’t mention any of the words or phrases included in the model, represented as a percentage difference. After combining all of the model predictions for all one-, two- and three-word phrases from each year, party, and platform combination, researchers then identified terms that were associated with at least a 10% boost in both favorites/reactions and shares/retweets on both platforms. Finally, researchers averaged the predicted boosts for each keyword across platforms and metrics (favorites, reactions, shares and retweets) to select the top keywords for each party and year. The resulting selection of keywords represent those that were associated with notably higher engagement on both platforms.

Distinctive keywords and phrases used by each party’s members of Congress on each platform (Facebook and Twitter) were identified using pointwise mutual information. Researchers then calculated the proportion of party members who mentioned each distinct term (phrase). Terms mentioned by fewer than 20% members of either party that were active during a given election period are excluded. Researchers then used the proportions to calculate a ratio of differences in mentions between parties for each term. The most distinctive party keywords were defined as those terms with the largest ratio difference between the parties.

As a final step for both keywords analysis, researchers consolidated phrases, removing those that had a word in common with any other phrase that was associated with a larger difference (e.g., “Paycheck Protection” is not shown as one of the most distinctive terms among Republicans in 2020 because “Paycheck Protection Program” was associated with an even larger party difference) and those that were part of a general speech pattern with no important contextual meaning (e.g., “past time” as part of “it was past time for congress to act,” a general call-to-action phrase that is popular among lawmakers, is removed). Terms have been edited slightly in some cases for readability (e.g., “make a plan to vote” instead of “make plan vote”). Words that appeared in retweets are included in this analysis, even if the member who retweeted them did not create the original tweet.

Domain and Link Analysis

In order to identify the individual domains that members of Congress linked to on Facebook and Twitter, researchers needed to identify the website from which each of the links was shared. First, researchers used the canonical link function from Data Labs’ open-source python library Pewtils. This function tries to a resolve a link to its “most correct” version by checking for checking for things like expanding short URLs from services like bit.ly/Twitter among others.³ Researchers identified 166,552 links to 9,228 domains over the time period of the study.