Data we have and data we need

Introduction

This section discusses in greater detail some of the major datasets identified by the speakers and their limitations. Problems with these datasets fall into two principal categories: inappropriate and inconsistent definitions; and limitations, bias, and error arising from multiple sources. These problems result in misuse of terms and affect coverage, availability, and reliability of the data and hence potentially undermine subsequent analyses.

Definitions

All of the speakers noted problems with the definition of broadband. Those who used zip codes in their analysis also identified inconsistencies between the FCC and the Census Bureau in the definition of zip codes. Other speakers, notably Chaudhuri and Gabel, pointed to inconsistencies in definitions in data sets that are used less widely.

Broadband

Under Section 706 of the Telecommunications Act of 1996, the FCC collects standardized information from qualified broadband providers on whether they have one or more lines in service in each zip code.²⁴ Providers supply the information via Form 477 twice a year at six month intervals and the agency publishes a report and, until recently, made the aggregated data available in downloadable Excel files. More recently, the FCC has decided to make the data available only in the form of Acrobat pdf files which are much more difficult to integrate into statistical analyses packages. The FCC reports the number of carriers with one or more lines in service in each zip code that has at least one provider, but if the number of providers is less than three, the FCC reports an asterisk. Because most communities have at most two facilities-based providers (copper telephone lines supporting DSL service and coaxial television cables supporting cable modem service), this mode of reporting severely limits the usefulness of the FCC data for analyzing the extent of competition available in local markets. The only data that the FCC reports on the number of lines in service are at the aggregate state level, limiting the ability to use the data to study the effect of differing penetration rates by community.

It is true that individual states also collect data, some of it more granular than the FCC’s. For example, Gabel pointed to data in Vermont that show where cable networks are deployed throughout the state, arguing that investigators should use state as well as federal data. Unfortunately, coverage is inconsistent from state to state, and in contrast to Vermont, one audience participant said data for the state of Pennsylvania are either “outdated” or represent “projections for where they want to get to.” Since the workshop was held, the ConnectKentucky initiative (http://www.connectkentucky.org/) has gained widespread currency as a model for state mapping and data collection. At the workshop, Brian Mefford of ConnectKentucky talked about this public-private partnership to identify gaps and encourage infrastructure build-out in Kentucky.²⁵

The FCC did not initially employ the term broadband in its documents. Instead, it defined two levels of service: high speed lines, meaning lines or wireless channels capable of transmitting at rates greater than or equal to 200Kbps in one direction; and advanced services lines, meaning lines or wireless channels capable of transmitting at rates greater than or equal to 200Kbps in both directions. The definition of transmission speed dates to the FCC’s first semi-annual report, published in January 1999.²⁶ At that time, the 200Kbps metric was approximately four times the speed of the typical dial-up connection of 50Kbps and was slightly faster than the 128Kbps rate of ISDN services, thereby ensuring that ISDN services would not be counted as broadband services.

Prior to 2004, providers with fewer than 250 high speed lines or wireless channels in a given state were exempt from the reporting requirement, thus potentially under-representing rural areas with low population densities. Beginning in 2005, providers below the reporting threshold were obligated to submit Form 477 information. Given the reporting and publication lag, this resulted in a substantial one-time jump in the number of holding companies and unaffiliated entities providing broadband for the period December 31, 2004 to June 30, 2005. Improving the granularity of coverage is a welcome development but it inhibits longitudinal use of the data, since generalizations about low coverage environs are clearly suspect for the period prior to 2004 while conclusions about coverage in areas with better coverage may well be overstated. Moreover, as Sharon Gillett and her colleagues observed, over half of the zip codes in their panel study already had broadband by 1999, so the scope of the data collection precludes investigations of the places which first had broadband available, at least through this source of data.

The chronological scope of availability prior to 1999 is an artifact of the program, and investigators are compelled to seek other sources of information for deployment prior to that time. However, other dimensions of the data collection effort, most importantly the issue of threshold transmission rates, can be adjusted to reflect changing realities. In 2006, the Commission collected more finely grained information about services offered in excess of 200Kbps.19 Not surprisingly, almost 60 percent of the high speed lines fell into the category of greater than or equal to 2.5Mbps and less than or equal to 10Mbps, and just under 5 percent had transfer rates greater than or equal to 10Mbps.%%FOONOTE%% As a number of speakers noted, efforts to refine the definition of broadband to reflect the changing nature of broadband services and the availability of ever-higher-data rates is long overdue. And indeed, FCC Chairman Martin announced in his testimony before Committee on Energy and Commerce in the U.S. House of Representatives, that in the Fifth Inquiry, the commission seeks comment on whether the term “advanced services” should be redefined to require higher minimum speeds.²⁷

Zip codes

In his testimony, Chairman Martin also cited a proposal put forward in September 2006 to improve data collection be examining specific geographic areas and integration of FCC data with data collected by states and other public sources. Workshop participants acknowledged the importance of integrating data from state and federal sources, but more forcefully drove home the problems with using zip codes.

Both the FCC and the Census Bureau use the zip code as a unit of analysis but they define it differently, creating problems when researchers seek to merge data sets. The Census Bureau has created new statistical entities called “zip code tabulation areas” (ZCTAs) which represent the generalized boundaries of US Postal Service zip code service areas; these are not equivalent to the older zip codes, and it is clear that the FCC is not using the ZCTAs. Moreover, not all zip codes denote spatial units although they are widely believed to do so. Rather, zip codes reflect their origins in the postal service and are actually linear features, corresponding to mailing addresses and streets in USPS service areas. Finally, zip codes are thought to denote relatively small spatial units, at least in comparison with states and counties.

Zip codes do represent relatively compact areas in urban environments, Grubesic found, but not in exurban or rural areas. Flamm agreed that zip codes are actually fairly coarse and can measure relatively large territories. In addition, Flamm noted that the FCC uses a publicly undocumented and proprietary database, which the agency purchases from a provider. The provider, however, uses zip code mapping software that adds and drops zip codes relatively rapidly, further complicating the ability to figure out what territory corresponds to what code over time.²⁸

A second set of problems arises from the way that the FCC collects, reports, and aggregates – or does not aggregate – data within zip codes. First, prior to data collected in June 2005, no zeros are measured, Flamm found, so identifying zip codes’ areas without any service requires subtracting all of the zip codes where broadband service is provided from the universe of zip codes used by the FCC for this period. Unfortunately, the FCC has chosen not to publish a list of the universe of zip codes used in its classifications. Like the expansion to include smaller providers, though, this welcome change in data collection inhibits longitudinal studies.

Second, zip codes where 1 to 3 providers exist are reported at the categorical level. Actual counts are provided for zip codes with more than 3 providers, in effect masking under provisioning while detailing better resourced areas where the local market is presumably more competitive. Prieger echoed this point, observing that researchers cannot say anything about monopoly vs. duopoly vs. oligopoly in areas of served by 1-3 providers.

Third, Flamm argued, the cartographic presentation can be misleading, especially in areas where zip codes are large, which are generally the rural areas, since the value, whether categorical or an actual count, is then mapped over the entire expanse.

Thus, three factors converge to potentially overestimate coverage, or to give the appearance of overestimating coverage, in what may actually be under-resourced areas: the absence of the zero measurement prior to 2005, the categorical representation of the service areas in which 1-3 providers are present, and the coarse geographic measure associated with zip codes in rural areas. Flamm offers two examples:

In the 78731 zip code in Austin, Texas, where he resides – a city that is generally considered a center of innovation – the FCC statistics indicate 24 broadband service providers. “After an exhaustive search for alternatives, however, I know that there is only one physical, local broadband service provider available within my immediate neighborhood.”²⁹ Residents of affluent Fairfax County in Northern Virginia had precisely the same experience of limited availability of connectivity to the residential neighborhoods at the same time that it contained one of the backbone hubs.

More generally, he cited research done by the GAO to assess the number of competitive entities providing service nationally. Their study showed that the median number of providers serving households at the zip code level went from 8 to 2. In Kentucky, for example, where the GAO’s calculation based on FCC information showed that service was available to 96 percent of the state’s population, ConnectKentucky did a more sophisticated analysis and found that only 77 percent of the state’s households had access to broadband service.

Flamm also took issue with the way providers are defined and identified. The FCC defines providers as “facilities based providers” of high-speed services to end user locations anywhere in a state in which they own hardware facilities. These, he points out, can be service providers, not actual infrastructure providers or “hardware pipe provisioners,” in some local markets, but not in others within a given state. It is unclear whether or not such mixed providers of broadband service distinguish carefully between zip codes in which they own some of the hardware used to provide the service, and zip codes in which their service is branded and resold by the actual hardware pipe provider. If multiple providers “brand the same pipe,” the view of competition is affected. Further, since identities of providers are not known, Prieger added, nothing can be said about industry dynamics (entry and exit), impact of multi-market contact, or intermodal competition (for example, cable vs. DSL). In addition, the FCC identifies location of service by the location to which the bill is delivered. If the bill goes one place (say, a post office box), and the line is actually installed elsewhere (say at a home), then the companies are reporting the zip code of the post office box, not the home, to the FCC. Thus, the FCC data measures what zip codes broadband recipients are being billed in, not necessarily where the service is available or actually consumed.

Available data and their limitations, bias and error

The data assembled by the FCC is intended to assess competition by looking at what providers have deployed. It is not geared to evaluate quality of service, performance at the desktop, or even penetration. Even though it does provide information on availability, it does so imperfectly. Penetration rates, which are necessary for analyses of regional economic impacts, are usually developed by layering demographic data, typically from the Census Bureau, over geographic data. That produces problems with definitions of zip codes which inhibits merging the two data sets. Flamm has devised a strategy for reconciling the discrepancies between the two zip code definitions but acknowledged that the resulting sample would under-represent remote, sparsely populated rural areas. The strategy itself involves limiting the analysis to the intersection of zip codes that appear in both the FCC and Census Bureau pools. Two types of zip codes, “geo” zip codes and “point” zip codes, are assigned and the boundaries laid out, allowing for the fact that the spatial units associated with the two schemes will probably not coincide perfectly but will cover a similar area. On this basis, he has been able to construct a database that allows him to examine spatial, demographic and topographic as well as economic variables.³⁰

Other than the FCC, workshop participants described several other sources of data, namely the Bureau of Labor Statistics, Bureau of Economic Analysis, and Census Bureau. The Census Bureau conducts the national decennial survey as well as more focused studies at more frequent intervals, notably the Current Population Survey, the American Communities Survey, and the Current Employment Statistics Survey. The advantages of using these collections are long runs of data, broad coverage at a national scale, quantity and variety, and the professionalism and prestige of the federal statistical agencies. The Bureau of Economic Analysis (BEA) and the Bureau of Labor Statistics (BLS) are two major sources of economic and industry-related data. The E-Stats program is a comparatively recent and also employs separate Census Bureau surveys to compile information on economic and e-commerce activities. In general, these agencies are slow to adopt methodological changes, but they adjust the categories of information they collect on special topics.

Triplett co-authored a paper with his colleague Barry Bosworth at the Brookings Institution in 2003 in which they detailed some of the recent progress in data collection at these agencies as well as some then still-remaining issues, notably inconsistent data sources that affect measures of productivity.³¹ Greenstein notes that BEA has recently begun publishing information on investment levels in information/communications technologies by industry and has included wages and salaries for workers in some locales. However, these studies address information technologies at a broad level; internet and other components are not isolated in the research. Somewhat bravely, the Census Bureau attempted a survey of software used by firms for the year 2003, Greenstein comments, but the task proved monumental and the results were inconclusive. The design called for surveying firms, not individual establishments, and the practical issues were daunting, starting with deciding who to contact and what information, presumably in the form of an inventory of software tools, even existed.

In general, researchers have trouble finding data at a suitably granular level. This is a problem, for example, that affects studies of firms, salaries and wages, and pricing. One solution is using private sources of information, but these are expensive and can be limited by issues of confidentiality and proprietorship. Greenstein, Forman, and others have made extensive use of data supplied by the business intelligence and marketing company Harte-Hanks. Not only are the datasets are expensive, but they naturally reflect the interests of the company’s clients who have paid for the initial surveys. Thus, the content of the data files is geared toward marketing and not necessarily toward the questions that investigators may have. In a subsequent e-mail exchange, Greenstein has offered several illustrations:

The company currently archives its data but has no plans for donating it to a public (or even private) curatorial facility, potentially inhibiting developing longitudinal studies of change over time. Moreover, digital data is fragile and requires active management. A corporate policy of benign neglect, which would not necessarily destroy physical records, can result in unusable digital data, which effectively destroys it.
The company focuses on coverage of potential sales targets. This strategy overlaps with the statisticians’ need to get the biggest users. But it also means it will be deficient in some systematic ways. Greenstein and his colleagues have done comparisons against county business patterns and have noticed the following: the coverage of small firms is rather inadequate. The reasons are obvious. Unlike a government agency concerned about the need to be complete, there is no systematic over-sampling of the underrepresented population. (Sampling issues matter to statisticians but not to most clients.)
Harte-Hanks data provide a measure of the number of programmers, but they do not provide detail on the composition of the computer workforce – their quality, education, experience, wages, or degree of autonomy. Without such measures, it is not possible to get at the core issues about whether the degree of decentralization is changing over time, whether computing is affiliated with biased technical change, and so on.

For demographic research, the key federal data source is the Current Population Survey, which is a monthly household survey conducted by the Bureau of the Census for the Bureau of Labor Statistics, which has included questions about computer and internet use in households in the period 1994-2003. This data collection effort was the basis for the series of reports by the National Telecommunications and Information Administration (NTIA) on computer use beginning in 1994; internet use was added in 1997. The program was ended after 2003 but the data can still be downloaded from the Bureau of Labor Statistics’ website: http://www.bls.census.gov/cps/computer/computer.htm.

Federal surveys of computer and internet access and use provide baseline information according to basic demographic characteristics (age, location, educational achievement, and income). Little is known from these federal surveys about behavior, choices and motivation, or even what people pay for broadband service, although many of these topics are explored in surveys conducted by the Pew Internet Project. Social scientists typically augment these large, primarily descriptive studies with information from private sources that they either purchase or create through new data collection efforts. In some cases, though, the data may contain outright errors as Strover discovered in her use of GIS data obtained from private sources. It also may be, as Flamm said that the methodology for collecting and categorizing data is not documented.

Compared with the national statistical efforts, academic studies are focused but small so that the breadth of the national surveys is balanced by the depth of the academic surveys. The kinds of databases that Flamm, Grubesic and Goldfarb describe are time consuming and expensive to build and tend to be geographically restricted so that the detailed work is needed to resolve the discrepancies and establish correct linkages. Computer scientists call this “registering” the data and it means the techniques required to achieve valid integration of disparate data sets. It is a problem that is endemic to use of multiple quantitative datasets that, when assembled, can yield wonderful information but in themselves are more heterogeneous than they appear. Decisions, like Flamm’s resolution of zip codes, are always part of the research process, so that documenting the management of the data becomes an integral component of presenting results since it may well happen that the process of integrating the datasets introduces a bias, as Flamm and others readily acknowledge.

The work done by Strover and her students Fuentes-Bautista and Inagaki reflects a range of survey designs, sampling techniques, and methods of information capture, including telephone interviews, face-to-face interviews and mailed surveys. Sometimes formal sampling is not possible, as shown in the research done by Fuentes-Bautista and Inagaki. Their project focuses on an intrinsically self-organizing population, patrons of commercial establishments that have WiFI, making the sample opportunistic, what statisticians call a “non-probability sample.” These mainly qualitative research methods produced nuanced portraits of special populations, but, by their very nature, do not permit generalizations about the general population.

These projects all have methodological limitations: low response rates, strained interpersonal dynamics, concerns about personal privacy and inappropriate disclosure of information, suspicion, and reliability. Fuentes-Bautista and Inagaki attempted to correct some of the bias in their design by creating a database of available hotspots, but they could not find a “comprehensible” list of them. Strover explained ways in which she and her colleagues remedy low response rates, particularly from groups who tend to be under-reported. Follow-up calls, mailed as well as door-to-door surveys, and use of bilingual interviewers are among the methods. Still, self-reporting, whether by phone or in person, can be biased by a number of factors, as Strover and her colleagues documented in their cross cultural research, including language, gender, age, and perception of status. Roles also introduce constraints between otherwise similar interviewer and interviewee; corporate representatives may be reluctant to provide the information because it is confidential or proprietary, or, as the Census Bureau found in its ambitious but ill-fated attempt to survey companies for software use, the individuals answering the question simply may not know the answer.³²

Goldfarb points out that clickstream data, which is collected at the device, may provide a useful way to offset questions about reliability of self-reported information. This type of research has a relatively long history in the computer science/information science community, which has collectively devised multiple experiments that combine analysis of computer log files with realtime observation and interviews. Log files or the more granular clickstream data show what people actually do online. “It’s not always pretty,” Goldfarb said, but “it provides rich detail.” He also acknowledged that clickstream data is hard to use. Using it entails more manipulation, known as “data cleaning,” and analyzing the “cleaned” data requires more statistical sophistication.

As computer and information scientists have learned, though, collecting this kind of information sparks concerns about personal privacy. Formal experiments in laboratory settings are bounded by strict policies governing data management and use. The kinds of concerns about sample size and composition can also arise in small, laboratory-based projects. Thus, corporate data collections are appealing because of their scale and scope. According to Goldfarb, two private companies, comScore and Nielsen//NetRatings, collect this data on home usage but neither shares the raw data with researchers. Concerns about privacy and disclosure are not attached solely to personal information. As Fuentes-Bautista and Inagaki discovered and Wallsten reiterated, companies are careful about the information they provide, particularly if the information reflects or might reflect upon human, corporate, or computational performance. This enhances the significance of the data on internet traffic that CAIDA collects.

The FCC collects data in all 50 states, as well as the District of Columbia and U.S. possessions.↩

See “Wiring Rural America,” The Economist, September 13, 2007 for more on ConnectKentucky. Available online at: http://www.economist.com/world/na/displaystory.cfm?story_id=9803963↩

U.S. Federal Communications Commission Inquiry Concerning the Deployment of Advanced Telecommunications Capability to All Americans in a Reasonable and Timely Fashion, and Possible Steps to Accelerate Such Deployment Pursuant to Section 706 of the Telecommunications Act of 1996, CC Docket No. 98-146, January 28, 1999, http://www.fcc.gov/broadband/706.html, p. 20.↩

Written Statement of the Honorable Kevin J. Martin, Chairman of the Federal Communications Commission before the Committee on Energy and Commerce, U.S. House of Representatives, March 14, 2007 http://hraunfoss.fcc.gov/edocs_public/attachmatch/DOC-271486A1.pdf%20– %20see%20pages%203-4, p. 4↩

U.S. Federal Communications Commission, High-Speed Services for Internet Access: Status as of June 30, 2006 (Washington, DC, February 2007), Table 1, Chart 1. http://hraunfoss.fcc.gov/edocs_public/attachmatch/DOC-270128A1.pdf, Chart 9.↩

Martin, Testimony, 2007, p. 4.↩

Flamm’s strategy and initial conclusions are described in Flamm, 2007, pp. 9-10.↩

Bosworth and Triplett, pp. 31-35.↩

Greenstein, 2007, p. 9.↩