Numbers, Facts and Trends Shaping Your World

What Web Browsing Data Tells Us About How AI Appears Online

Methodology

Table of Contents

Table of Contents

The findings in this report are based on an analysis of the browsing behaviors of 900 U.S. adults who are members of the KnowledgePanel Digital online panel, a subset of Ipsos’ KnowledgePanel. These panelists qualified for the study because they:

  • Responded to a Pew Research Center pilot survey conducted on KnowledgePanel Digital in November 2023 (n=1,254)
  • Were still active members of both KnowledgePanel (n=1,206) and KnowledgePanel Digital (n=1,102) as of March 2025
  • Consented in 2025 to Ipsos providing their individual-level browsing data to clients (n=992)
  • Were active on KnowledgePanel Digital at least once between March 1 and March 31, 2025 (n=900)

After the March monitoring period, panelists’ web browsing activity logs were delivered on April 7, 2025. The dataset containing these logs included 2.5 million visited URLs with metadata, including an ID for the panelist who visited the URL, device information, the time when the URL was accessed and the duration of the visit.

About IPSOS KnowledgePanel and KnowledgePanel Digital

KnowledgePanel is a probability-based online panel designed to be representative of the adult U.S. population. The recruitment process employs an addressed-based sampling methodology from the latest Delivery Sequence File of the USPS, which has been estimated to cover as much as 98% of the population, although some studies suggest that the coverage could be in the low 90% range.5 Here are additional details about the KnowledgePanel.

KnowledgePanel Digital consists of members of the broader KnowledgePanel who meet the following conditions:

  • Access the internet using an Android smartphone, Apple smartphone, Android tablet, Apple tablet, Windows computer or Apple computer.
  • Have agreed to join KnowledgePanel Digital and have installed the RealityMeter app on at least one qualifying device to collect their device usage and internet activity.
  • Have agreed to the RealityMine Privacy Policy and Terms & Conditions.

For households without internet service prior to joining KnowledgePanel, Ipsos provides web-enabled devices and free internet service. KnowledgePanel members from these households are not invited to join KnowledgePanel Digital.

Weighting

The data was weighted in a process that accounts for multiple stages of sampling and nonresponse that occur at different points in the panel survey process. First, each panelist begins with a base weight that reflects both their probability of recruitment into the panel and their probability of selection for this survey. Base weights for this study were provided by Ipsos. Next, these weights were calibrated to align with the population benchmarks in the accompanying table and trimmed at the 1st and 99th percentiles to reduce the loss in precision stemming from variance in the weights. Sampling errors and tests of statistical significance take into account the effect of weighting.

A table showing Weighting dimensions

The following table shows the unweighted sample sizes and the error attributable to sampling that would be expected at the 95% level of confidence for different groups in the survey.

A table showing Sample sizes and margins of error

Data collection and processing

Before collecting webpage content, we processed the dataset of URLs to prevent errors in the scraping process and avoid potentially harmful or explicit websites.

To ensure that each panelist meaningfully viewed each page in the dataset, we removed URL visits where the duration of the visit was zero seconds. We also combined “duplicate” records where a panelist visited the same URL twice within a one-second window.

 For each URL in the dataset, we extracted a domain (for example, the URL “https://www.facebook.com/login” has the domain “facebook.com”). Based on the URLs and their domains, we filtered out certain categories of webpages that we were not interested in scraping. We excluded the following URL categories from this scraping process:

  • Known malware. To identify malicious websites that are known to be used for malware distribution, we used a freely available list of malware URLs. The list is compiled by URLhaus, a platform that tracks malware URLs and shares them with security providers.
  • Adult content. Adult content websites were categorized as websites whose main purpose is to host pornographic or highly sexualized content, including file-sharing sites that are regularly used for pornography distribution. To identify adult websites, we used an open-source URL list that is maintained primarily for network administrators at schools and workplaces.
  • Popular productivity tools that require a login. Personal productivity tools like email and calendar apps were excluded, since the content we would be scraping was almost certain not to reflect the page content the panelist had viewed.

We removed these types of pages from the web scraping pipeline and did not check them for AI keywords. However, these pages were still included in each respondent’s total visit count.

Scraping webpage content

After preprocessing the URLs from the web browsing activity logs to remove the pages which matched the criteria above, we used a web scraping pipeline based on the Requests Python library to retrieve all HTML and any metadata from each page. We collected this content April 7-11, 2025.

In some cases, the attempt to access a URL would time out or our servers otherwise failed to connect to the given web address. This includes those that returned HTTP error codes, such as 404 (Not Found) or 403 (Forbidden) errors. These pages were also not included in the AI keyword analysis but included in each respondent’s total visit count.

There were a handful of pages that returned HTTP 403 (Forbidden) errors that we determined were important to try to retrieve for the analysis, including a number of AI chatbots (such as Google Gemini and Grok) and pages from Reddit. These URLs were backfilled April 16-17, 2025, using manual HTML download and the Reddit Data API, as applicable.

URLs that did not resolve or redirect to an existing webpage, according to a list of valid top level domains, could not be verified as a valid visit to a page and were excluded from respondents’ total visit count (n=2,402 URLs).

In addition to pages noted above that we did not scrape, the following types of content may have been visible to respondents but would have been missed by the automated scraping process:

  • Non-text content like images, audio or video
  • Content that loads dynamically using JavaScript
  • Content hidden behind a paywall or login screen

Google search result pages are not retrievable through most traditional web scraping methods, including the one we implemented in this analysis. Because of this, we collected the content of Google search result pages using a third-party web scraping service.

After all data collection and preprocessing was complete, the total number of distinct URLs we were able to retrieve page content for and analyze for AI term matches was 965,136. A total of 2,457,176 page visits to 1,107,424 URLs were analyzed in the report. This includes 142,288 URLs that we were not able to analyze for AI term matches but included in the analysis of respondents’ total page visits.

Preprocessing

To prepare the page content data (n=965,136) for analysis, we removed HTML tags, JavaScript and other code from the scraped webpage data using BeautifulSoup. By doing this, we ensured the page content that was analyzed was as close as possible to the content that panelists would have viewed when they visited the webpage.

Any scraped webpages that contained excessively large amounts of data (greater than 1 gigabyte or 128,000 tokens) were also excluded from the web content dataset. These websites often contained video data or other non-text content.

AI keyword matching

To identify AI content in the webpage data, we started by compiling a list of AI-related terms (read the Appendix). These terms included technical AI terminology, as well as the names of various generative AI tools and well-known AI companies. Keywords were selected based on their specificity and their prevalence in current discussions of AI. To avoid returning matches that were not truly related to AI, we filter this list to only terms that are exclusively (or nearly exclusively) used to describe AI-related concepts.

All webpages that contained text matching at least one keyword in the list were flagged as a page that mentions AI. Of the roughly 1.1 million distinct pages viewed by panelists over a period of one month, around 6% of them (71,144 in total) mention an AI keyword.

AI relevance classification

While keyword matching was used to identify pages with any mention of AI-related terms and concepts somewhere on the page, these webpages are extremely diverse in their focus on the topic. Some make only the briefest mention to AI or AI-related topics – such as a reference to an AI tool in a sidebar or footer. Others contain meaningful or substantive discussion of AI or are themselves AI-centric tools or services.

For this analysis, we used a logistic regression classifier to identify pages that mention AI in a substantive context. Common examples of “substantive mentions” might include pages such as:

  • An AI chatbot interface, AI image generation site, AI-focused consulting service or other online tool that makes AI a central focus of its functionality
  • A shopping website that prominently features AI functionality in its product descriptions
  • An article or story that is about (or extensively discusses) AI or AI-related topics

The classifier was trained on a dataset of 509 webpages labeled as containing either a substantive AI mention or a minor AI mention. Two human annotators generated the labels and achieved high inter-rater reliability on the training dataset (Cohen’s kappa = 0.877). The classifier took five input variables as features: the number of AI keyword matches, number of AI keyword matches that were not “AI,” number of AI keyword matches in the website’s title, number of AI keyword matches in the website’s description, and the proportion of all words on the page that are AI keywords.

After training the classifier, model performance was measured using a separate evaluation dataset of 400 labeled webpages. For this dataset, human annotators again achieved high inter-rater reliability (Cohen’s kappa = 0.837). When applied to the evaluation dataset, the classifier achieved an F1 score of 0.829. The model was highly reliable for identifying substantive AI mentions but was somewhat prone to false positives; the classifier’s performance on the evaluation data gave a recall score of 0.970 and a precision score of 0.724. These metrics indicate that the model was sufficient for use in this report.

An illustration showing that A page that mentions AI versus one that discusses it in more depth

Domain and page content analysis

In this report, we examine how respondents encountered mentions of AI on different types of websites. Here is how we created the different site categories for that analysis.

News websites include 2,317 domains categorized as “News/Information” by the measurement and audience metrics company Comscore.

Shopping websites include a list of 18 major shopping and e-commerce domains developed by the authors of this study. Researchers consulted two lists of popular shopping and e-commerce domains (via Statista and Semrush) and cross-referenced them against respondents’ most-visited domains. Our final list includes aliexpress.com, amazon.com, apple.com, bestbuy.com, chewy.com, costco.com, craigslist.com, ebay.com, etsy.com, homedepot.com, kohls.com, macys.com, mecari.com, target.com, temu.com, ticketmaster.com, walmart.com, and wayfair.com.

Search websites include google.com/search, bing.com/search, duckduckgo.com and search.yahoo.com. Analyses of AI-generated summaries include only pages from google.com/search.

Social media websites include Meta sites (facebook.com, instagram.com, threads.net, whatsapp.com), youtube.com, tiktok.com, bsky.app, pinterest.com, linkedin.com, x.com and reddit.com.

Visits to generative AI tools include those from OpenAI (openai.com, chatgpt.com, chat.openai.com, openai.com/chatgpt, openai.com/dall-e-2), Microsoft (copilot.microsoft.com, bing.com/chat), Google (bard.google.com, gemini.google.com), claude.ai, midjourney.com, perplexity.ai and character.ai.

  1. AAPOR Task Force on Address-based Sampling. 2016. “AAPOR Report: Address-based Sampling.”
Icon for promotion number 1

Sign up for our weekly newsletter

Fresh data delivered Saturday mornings

Icon for promotion number 1

Sign up for The Briefing

Weekly updates on the world of news & information

Report Materials

Table of Contents

Table of Contents