NBER Working Papers and Publications

We have seen in the past decade a sharp increase in the extent that companies use data to optimize their businesses. Variously called the `Big Data' or `Data Science' revolution, this has been characterized by massive amounts of data, including unstructured and nontraditional data like text and images, and the use of fast and flexible Machine Learning (ML) algorithms in analysis. With recent improvements in Deep Neural Networks (DNNs) and related methods, application of high-performance ML algorithms has become more automatic and robust to different data scenarios. That has led to the rapid rise of an Artificial Intelligence (AI) that works by combining many ML algorithms together – each targeting a straightforward prediction task – to solve complex problems. We will define a framew...
in The Economics of Artificial Intelligence: An Agenda, Ajay K. Agrawal, Joshua Gans, and Avi Goldfarb, editors
March 2017Text as Data
with Matthew Gentzkow, Bryan T. Kelly: w23276
An ever increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications.
July 2016Measuring Polarization in High-Dimensional Data: Method and Application to Congressional Speech
with Matthew Gentzkow, Jesse M. Shapiro: w22423
We study trends in the partisanship of congressional speech from 1873 to 2016. We define partisanship to be the ease with which an observer could infer a congressperson’s party from a fixed amount of speech, and we estimate it using a structural choice model and methods from machine learning. Our method corrects a severe finite-sample bias that we show arises with standard estimators. The results reveal that partisanship is far greater in recent years than in the past, and that it increased sharply in the early 1990s after remaining low and relatively constant over the preceding century. Our method is applicable to the study of high-dimensional choices in many domains, and we illustrate its broader utility with an application to residential segregation.
