Investigating Alternative Data Sources to Reduce Respondent Burden in United States Census Bureau Retail Economic Data Products
Rebecca J. Hutchinson, U.S. Census Bureau
From Transactions Data to Economic Statistics: Constructing Real-Time, High-Frequency, Geographic Measures of Consumer Spending
Aditya Aladangady, Federal Reserve Board
Shifrah Aron-Dine, Stanford University
Wendy Dunn, Federal Reserve Board
Laura Feiveson, Federal Reserve Board
Paul Lengermann, Federal Reserve Board
Claudia Sahm, Federal Reserve Board
Chris Wheat, JP Morgan Chase Institute

Data on consumer spending are important for tracking economic activity and informing economic policymakers in real time. Aladangady, Aron-Dine, Dunn, Feiveson, Lengermann, and Sahm describe construction of a new data set on consumer spending. They transform anonymized card transactions from a large payment technology company into daily, geographic estimates of spending that are available only a few days after the spending occurred. The Census Bureau's monthly survey of retail sales is a primary source for monitoring the cyclical position of the economy, but it is a national statistic which is not well suited to study localized or short-lived shocks. Moreover, lags in the release of the survey and subsequent -- sometimes large -- revisions can diminish its usefulness for policymakers. Expanding the official survey to include more detail and faster publication would be expensive and add substantially to respondent burden. The approach helps fill these information gaps by using data on consumer spending with credit and debit cards and other electronic payments from a private company. The researchers daily series are available from 2010 to the present and can be aggregated to generate national, monthly growth rates similar to official Census statistics. As an application of the new, higher-frequency, geographic information in the researchers' data set, they quantify in real-time the effects on spending of Hurricanes Harvey and Irma.

In addition to the conference paper, the research was distributed as NBER Working Paper w26253, which may be a more recent version.

Off to the Races: A Comparison of Machine Learning and Alternative Data for Predicting Economic Indicators
Andrea Batch, Bureau of Economic Analysis
Jeffrey C. Chen, Bureau of Economic Analysis
Alexander Driessen, Bureau of Economic Analysis
Abe Dunn, Bureau of Economic Analysis
Kyle K. Hood, Bureau of Economic Analysis
Francis X. Diebold, University of Pennsylvania and NBER
A Machine Learning Analysis of Seasonal and Cyclical Sales in Weekly Scanner Data
Rishab Guha, Harvard University
Serena Ng, Columbia University and NBER
Gary Cornwall, Bureau of Economic Analysis

Guha and Ng analyze weekly scanner data collected for 108 groups at the county level between 2006 and 2014. The data display multi-dimensional weekly seasonal effects that are not exactly periodic but are cross-sectionally dependent. Existing univariate procedures are imperfect and yield adjusted series that continue to display strong seasonality upon aggregation. The researchers suggest augmenting the univariate adjustments with a panel data step that pools information across counties. Machine learning tools are then used to remove the within-year seasonal variations. A demand analysis of the adjusted budget shares finds three factors: one that is trending, and two cyclical ones that are well aligned with the level and change in consumer confidence. The effects of the Great Recession vary across locations and product groups, with consumers substituting towards home cooking away from non-essential goods. The data are thus informative about local and aggregate economic conditions once the seasonal effects are removed. The two-step methodology can be adapted to remove other types of nuisance variations provided that these variations are cross-sectionally dependent.

In addition to the conference paper, the research was distributed as NBER Working Paper w25899, which may be a more recent version.

Estimating the Benefits of New Products
W. Erwin Diewert, University of British Columbia and NBER
Robert C. Feenstra, University of California, Davis and NBER
Marshall B. Reinsdorf, International Monetary Fund

A major challenge facing statistical agencies is the problem of adjusting price and quantity indexes for changes in the availability of commodities. This problem arises in the scanner data context as products in a commodity stratum appear and disappear in retail outlets. Hicks suggested a reservation price methodology for dealing with this problem in the context of the economic approach to index number theory. Feenstra and Hausman suggested specific methods for implementing the Hicksian approach. Diewert and Feenstra evaluates these approaches and recommends taking one-half of the constant-elasticity gains computed as in Feenstra, which under weak conditions will be above but reasonably close to the gains obtained from a linear approximation to the demand curve as proposed by Hausman. The researchers compare the CES gains to those obtained using a quadratic utility function. The various approaches are implemented using some scanner data on frozen juice products that are available online.

In addition to the conference paper, the research was distributed as NBER Working Paper w25991, which may be a more recent version.

Transforming Naturally Occurring Text Data into Economic Statistics: The Case of Online Job Vacancy Postings
David Copple, Bank of England
Bradley J. Speigner, Bank of England
Arthur Turrell, Bank of England
Ayşegül Şahin, University of Texas at Austin and NBER

Copple, Speigner, and Turrell combine both official and naturally occurring data to get a more detailed view of the UK labor market which includes heterogeneity by both region and occupation. The novel, naturally occurring data are 15 million job vacancy adverts as posted by firms on one of the UK’s leading recruitment websites. The researchers map this messy online data into official classifications of sector, region, and occupation. The recruitment firm’s own unofficial job sector field is mapped manually into the official sectoral classification and, making use of official vacancy statistics by sector, is used to reweight the data to reduce bias. The researchers map the latitude and longitude of each vacancy directly into regions. In order to match up to official statistics organised by standard occupational classification (SOC) codes, the researchers develop an unsupervised machine learning algorithm which takes the text data associated with each job vacancy and maps it into SOC codes. The algorithm makes use of all text associated with a job, including the job description, and could be used in a range of other situations in which text must be mapped to official classifications. The researchers plan to make the algorithm available as a Python package via GitHub. Used in combination with official statistics, these data allow us to examine the weak UK productivity and output growth which have been enduring features of the post-crisis period. Labor market mismatch between the unemployed and job vacancies has previously been implicated as one driver of the UK's productivity ‘puzzle’ (Patterson, Christina, et al. "Working hard in the wrong place: A mismatch-based explanation to the UK productivity puzzle." European Economic Review 84 (2016): 42-56.). Using the fully labelled dataset, the researchers examine the extent to which unwinding occupational and regional mismatch would have boosted productivity and output growth in the post-crisis period. The effects of mismatch on output are driven by dispersion in productivity, tightness, and matching efficiency (for which the researchers provide new estimates). The researchers show evidence of significant dispersion of these across sub-markets, with the aggregate data hiding important heterogeneity. Contrary to previous work, the researchers find that unwinding occupational mismatch would have had a weak effect on growth in the post-crisis period. However, unwinding regional mismatch would have substantially boosted output and productivity growth relative to the actual path, bringing it in line with the pre-crisis trend. The researchers demonstrate how naturally occurring data can be a powerful complement to official statistics.

Re-Engineering Key National Economic Indicators
Gabriel Ehrlich, University of Michigan
John C. Haltiwanger, University of Maryland and NBER
Ron S. Jarmin, U.S. Census Bureau
David Johnson, University of Michigan
Matthew D. Shapiro, University of Michigan and NBER
Robert C. Feenstra, University of California, Davis and NBER

Traditional methods of collecting data from businesses and households face increasing challenges. These include declining response rates to surveys, increasing costs to traditional modes of data collection, and the difficulty of keeping pace with rapid changes in the economy. The digitization of virtually all market transactions offers the potential for re-engineering key national economic indicators. The challenge for the statistical system is how to operate in this data-rich environment. Ehrlich, Haltiwanger, Jarmin, Johnson, and Shapiro focus on the opportunities for collecting item-level data at the source and constructing key indicators using measurement methods consistent with such a data infrastructure. Ubiquitous digitization of transactions allows price and quantity be collected or aggregated simultaneously at the source. This new architecture for economic statistics creates challenges arising from the rapid change in items sold. The researchers explore some recently proposed techniques for estimating price and quantity indices in large-scale item-level data. Although those methods display tremendous promise, substantially more research is necessary before they will be ready to serve as the basis for the official economic statistics. Finally, the researchers address implications for building national statistics from transactions for data collection and for the capabilities and organization of the statistical agencies in the 21st century.

Quantifying Productivity Growth in Health Care Using Insurance Claims and Administrative Data
Abe Dunn, Bureau of Economic Analysis
Dana Goldman, University of Southern California and NBER
John Romley, University of Southern California
Neeraj Sood, University of Southern California and NBER
Helen G. Levy, University of Michigan and NBER

Dunn, Goldman, Romley, and Sood assess changes in multifactor productivity (MFP) in delivering episodes of care (including that received after initial discharge from a hospital) for elderly Medicare beneficiaries with three important conditions over 2002-2014. Across the conditions, the researchers find that MFP declined during the 2000s and then stabilized. For heart attack, for example, MFP decreased by 15.9% over the study period. While heart-attack patients experienced better health outcomes over time, growth in the cost of care for these episodes dominated. The cost of hospital readmissions among heart-attack patients appears to have increased substantially.

Nowcasting the Local Economy: Using Yelp Data to Measure Economic Activity
Edward L. Glaeser, Harvard University and NBER
Hyunjin Kim, INSEAD
Michael Luca, Harvard University and NBER
Michael Cafarella, University of Michigan

Can new data sources from online platforms help to measure local economic activity? Government datasets from agencies such as the U.S. Census Bureau provide the standard measures of local economic activity at the local level. However, these statistics typically appear only after multi-year lags, and the public-facing versions are aggregated to the county or ZIP code level. In contrast, crowdsourced data from online platforms such as Yelp are often contemporaneous and geographically finer than official government statistics. Glaeser, Kim, and Luca present evidence that Yelp data can complement government surveys by measuring economic activity in close to real time, at a granular level, and at almost any geographic scale. Changes in the number of businesses and restaurants reviewed on Yelp can predict changes in the number of overall establishments and restaurants in County Business Patterns. An algorithm using contemporaneous and lagged Yelp data can explain 29.2 percent of the residual variance after accounting for lagged CBP data, in a testing sample not used to generate the algorithm. The algorithm is more accurate for denser, wealthier, and more educated ZIP codes.


In addition to the conference paper, the research was distributed as NBER Working Paper w24010, which may be a more recent version.

Improving the Accuracy of Economic Measurement with Multiple Data Sources: The Case of Payroll Employment Data
Tomaz Cajner, Federal Reserve Board
Leland D. Crane, Federal Reserve Board
Ryan Decker, Federal Reserve Board
Adrian Hamins-Puertolas, Federal Reserve Board
Christopher Kurz, Federal Reserve Board

Cajner, Crane, Decker, Hamins-Puertolas, and Kurz combine the information from two sources of U.S. payroll employment to increase the accuracy of real-time measurement of the labor market. The two data sources are the CES payroll employment series and an employment series based on microdata from the payroll processing firm ADP. The two time series are derived from roughly equally-sized and mostly nonoverlapping samples. The researchers argue that combining CES and ADP data series reduces the measurement error inherent in both data sources. In particular, they infer “true” unobserved payroll employment growth using a state-space model and find that the optimal predictor of the unobserved state puts approximately equal weight on the CES and ADP series. The researchers show that the estimated state helps forecast future values of CES, even controlling for lagged values of CES and a state estimate using CES information only. In addition, the researchers present the results of an exercise that benchmarks the data series to an employment census, the QCEW.

Securing Commercial Data for Economic Statistics
Katharine G. Abraham, University of Maryland and NBER
Margaret Levenstein, University of Michigan
Matthew D. Shapiro, University of Michigan and NBER

Securing Commercial Data for Economic Statistics

Big Data in the U.S. Consumer Price Index: Experiences & Plans
David Friedman, Bureau of Labor Statistics
Crystal G. Konny, Bureau of Labor Statistics
Brendan K. Williams, Bureau of Labor Statistics

The Bureau of Labor Statistics (BLS) has generally relied on its own sample surveys to collect the price and expenditure information necessary to produce the Consumer Price Index (CPI). The burgeoning availability of big data has created a proliferation of information that could lead to methodological improvements and cost savings in the CPI. The BLS has undertaken several pilot projects in an attempt to supplement and/or replace its traditional field collection of price data with alternative sources. In addition to cost reductions, these projects have demonstrated the potential to expand sample size, reduce respondent burden, obtain transaction prices more consistently, and improve price index estimation by incorporating real-time expenditure information -- a foundational component of price index theory that has not been practical until now. In CPI,Friedman, Konny, and Williams use the term alternative data to refer to any data not collected through traditional field collection procedures by CPI staff, including third party datasets, corporate data, and data collected through web scraping or retailer API's. The researchers review how the CPI program is adapting to work with alternative data, followed by discussion of the three main sources of alternative data under consideration by the CPI with a description of research and other steps taken to date for each source. The researchers conclude with some words about future plans.

The Scope and Impact of Open Source Software: A Framework for Analysis and Preliminary Cost Estimates
Carol Robbins, National Science Foundation
Gizem Korkmaz, University of Virginia
José B. Santiago Calderón, University of Virginia
Claire Kelling, Pennsylvania State University
Sallie Keller, University of Virginia
Stephanie S. Shipp, University of Virginia

Open source software is everywhere, both as specialized applications nurtured by devoted user communities, and as digital infrastructure underlying platforms used by millions daily. This type of software is developed, maintained, and extended both within the private sector and outside of it, through the contribution of people from businesses, universities, government research institutions, nonprofits, and as individuals. Robbins, Korkmaz, Calderon, Kelling, Keller, and Shipp propose and prototype a method to document the scope and impact of open source software created by these sectors, thereby extending existing measures of publicly-funded research output. The researchers estimate the cost of developing packages for the open source software languages R, Python, Julia, and JavaScript, as well as re-use statistics for R packages. These reuse statistics are measures of relative value. The researchers estimate that the resource cost for developing R, Python, Julia, and JavaScript exceeds $3 billion dollars, based on 2017 costs.

Valuing Housing Services in the Era of Big Data: A User Cost Approach Leveraging Zillow Microdata
Marina Gindelsky, Bureau of Economic Analysis
Jeremy Moulton, University of North Carolina, Chapel Hill
Scott A. Wentland, Bureau of Economic Analysis
Raven Molloy, Federal Reserve Board

Historically, residential housing services or "space rent" for owner-occupied housing has made up a substantial portion (approximately 10%) of U.S. GDP final expenditures. The current methods and imputations for this estimate employed by the Bureau of Economic Analysis (BEA) rely primarily on designed survey data from the Census Bureau. Gindelsky, Moulton, and Wentland develop new, proof-of-concept estimates valuing housing services based on a user cost approach, utilizing detailed microdata from Zillow (ZTRAX), a "big data" set that contains detailed information on hundreds of millions of market transactions. Methodologically, this kind of data allows us to incorporate actual market prices into the estimates more directly for property-level hedonic imputations, providing an example for statistical agencies to consider as they improve the national accounts by incorporating additional big data sources. Further, the researchers are able to include other property-level information into the estimates, reducing potential measurement error associated with aggregation of markets that vary extensively by region and locality. Finally, they compare the estimates to the corresponding series of BEA statistics, which are based on a rental-equivalence method. Because the user-cost approach depends more directly on the market prices of homes, the researchers find that since 2001 the initial results track aggregate home price indices more closely than the current estimates.

Using Public Data to Generate Industrial Classification Codes
Sudip Bhattacharjee, University of Connecticut
John Cuffe, U.S. Census Bureau
Ugochukwu Etudo, University of Connecticut
Justin D. Smith, U.S. Census Bureau
Nevada Basdeo, U.S. Census Bureau

The North American Industrial Classification System (NAICS) is the system by which multiple federal and international statistical agencies assign business establishments into industries. Generating these codes may be a costly enterprise, and the variety of data sources used across federal agencies leads to disagreement over the "true" classification of establishments. Bhattacharjee, Cuffe, Etudo, Smith, and Basdeo propose an improvement to the generation of these codes that could improve the quality of these codes and the efficiency of the generation process. The NAICS codes serve as a basis for survey frames and published economic statistics. In the current state, multiple statistical agencies and bureaus generate their own codes (e.g. Census Bureau, Bureau of Labor Statistics (BLS), and Social Security Administration) which can introduce inconsistencies across datasets housed at different agencies. For example, the business list comparison project undertaken by BLS and the Census Bureau found differences in classification even for single-unit establishments (Fairman et al., 2008, Foster et al., 2006). The researchers propose that combining publicly available data and modern machine learning techniques can improve accuracy and timeliness of Census data products while also reducing costs. Using an initial sample of approximately 1.3 million businesses gathered from public APIs, the researchers use user reviews and website information to accurately predict two-digit NAICS codes in approximately 59% of cases. The approach may have some merit, however substantial methodological and possible privacy issues remain before statistical agencies can implement such a system.

Measuring Export Price Movements With Administrative Trade Data
Don Fast, Bureau of Labor Statistics
Susan Fleck, Bureau of Labor Statistics

The International Price Program (IPP) surveys establishments to collect price data of merchandise trade and calculates import and export price indexes (MXPI). In an effort to expand the quantity and quality of MXPI, research the potential to augment the number of price indexes by calculating thousands, and potentially millions, of prices directly from export administrative trade transaction data maintained by the Census Bureau. This pilot research requires reconsideration of the long-held view that unit value price indexes are biased because product mix changes account for a large share of price movement. The research addresses this methodological concern and identifies others by analyzing two semi-homogeneous product categories among the 129 5-digit BEA End Use export categories. The results provide a road map of a consistent and testable approach that aligns with the concepts used in existing MXPI measures, maximizes the use of high volume data, and mitigates the risk of unit value bias. Preliminary analysis of all 129 5-digit BEA End Use categories for exports shows potential for calculating export price indexes for 50 of the 5-digit classification categories, of which 21 are currently not published.

Automating Response Evaluation for Franchising Questions on the 2017 Economic Census
Andrew L. Baer, International Monetary Fund
J. Bradford Jensen, Georgetown University and NBER
Shawn D. Klimek, U.S. Census Bureau
Lisa Singh, Georgetown University
Joseph Staudt, U.S. Census Bureau
Yifang Wei, Georgetown University

Between the 2007 and 2012 Economic Censuses (EC), the count of franchise-affiliated estblishments declined by 9.8%. One reason for this decline was a reduction in resources that the Census Bureau was able to dedicate to the manual evaluation of survey responses in the franchise section of the EC. Extensive manual evalution in 2007 resulted in many establishments, whose survey forms indicated they were not franchise-affiliated, being recoded as franchise-affiliated. No such evaulation could be undertaken in 2012. Baer, Jensen, Klimek, Singh, Staudt, and Wei examine the potential of using external data harvested from the web in combination with machine learning methods to automate the process of evaluating responses to the franchise section of the 2017 EC. The method allows the researchers to quickly and accurately identify and recode establishments have been mistakenly classified as not being franchise-affiliated, increasing the unweighted number of franchise-affiliated establishments in the 2017 EC by 22%-42%.


Below is a list of conference attendees.
Rachel Anderson, Princeton University
William Beach, George Mason University
Gary Cornwall, Bureau of Economic Analysis
Holden A. Diethorn, NBER
Richard Evans, Statistics Canada
Felix Galbis-Reig, Federal Reserve Board
Brian Harris-Kojetin, National Academy of Sciences
John Kitchen, Congressional Budget Office
Gizem Korkmaz, University of Virginia
Steve Liesman, CNBC
Fanny McKellips, Bank of Canada
Norman J. Morin, Federal Reserve Board
Kelsey O'Flaherty, Federal Reserve Board
Leigh Skuse, Office for National Statistics
Ciaren Taylor, Office for National Statistics
Tjark Tjin-A-Tsoi, Statistics Netherlands
Marcel Van der Steen, Statistics Netherlands
Gal Wachtel, Palantir
Daniel Williams, Palantir
Tom Williams, Office for National Statistics

More from NBER

In addition to working papers, the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter, the NBER Digest, the Bulletin on Retirement and Disability, and the Bulletin on Health — as well as online conference reports, video lectures, and interviews.

Economics of Digitization Figure 1
  • Article
The NBER Economics of Digitization Project, established in 2010 with support from the Alfred P. Sloan Foundation,...
  • Lecture
Claudia Goldin, the Henry Lee Professor of Economics at Harvard University and a past president of the American...
2020 Methods Lecture Promo Image
  • Lecture
The extent to which individual responses to household surveys are protected from discovery by outside parties depends...