Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and their Consequences
Rapid improvements in computational power and coincident increases in the amount of data on individuals and firms that is publicly available or can be purchased at a nominal cost have created new challenges for protecting the privacy of survey responses and administrative data. By combining such data with information from external sources, it is increasingly possible to breach the anonymity of individuals and businesses and their characteristics. Such breaches can violate the strong privacy protections statistical agencies and other data providers are required or pledge to uphold. Statistical agencies, such as the U.S. Census Bureau, have responded by adopting new disclosure avoidance systems, including model-based synthetic data methods and methods based on the criterion known as differential privacy.
All of these methods, as well as those that have been used in the past, face an inevitable trade-off between data accuracy and protecting the privacy of individuals and firms. There has been substantial research on the extent to which these approaches protect privacy and on the accuracy of the data that results from their application. Less attention has been paid to issues such as (a) the implications of using these data for conducting applied research and for estimation and inference, (b) quantifying the actual disclosure risks of alternative disclosure avoidance methods and how individuals, firms and society value these risks; and (c) how to best align the data needs for applied research with the obligations to protect privacy.
To provide new insights on the use of privacy-protected data in empirical research, the National Bureau of Economic Research (NBER), with the support of the Alfred P. Sloan Foundation, is launching a new research project. The project, headed by Ruobin Gong (Rutgers University), V. Joseph Hotz (Duke University and NBER), and Ian Schmutte (University of Georgia), will promote interactions among researchers from computer science, data science, economics, statistics, and other fields who will present their research at two conferences, one to be held in Cambridge on May 4-5, 2023, and the other in Washington, DC, in late 2023 or early 2024.
The conferences will include research on these, and other, topics:
• How should researchers change the way they conduct quantitative analysis, arrive at conclusions, or make decisions and policy recommendations when using privacy-protected data?
• How do privacy-protected data affect model and hypothesis formulation and demonstration of reproducibility?
• How does the use of privacy-protected data affect the identification, estimation, and uncertainty quantification of parameters of interest?
• How are formal privacy guarantees, such as differential privacy developed by computer science, related to other disclosure risk measures, such as those developed in the statistics literature?
• What are the disclosure risks associated with more complex data used in applied research, such as those that combine administrative and survey data, or longitudinal and panel data, and do existing disclosure avoidance approaches adequately minimize them?
• What do we know about attitudes towards privacy and the willingness of individuals or businesses to provide sensitive data?
• What do we know about the value, or cost, of improving the quality of publicly-available data?
• How do the statutory guidelines of statistical agencies (such as the U.S. Census Bureau’s titles 13 and 26) and the guidelines of other data providers map to formal privacy models? Are there changes that could improve their alignment?
• Are there alternatives to current privacy protection methods that might better align the needs for protecting privacy with the conduct of applied research than existing privacy protection methods?
The organizers welcome the submission of proposals for research papers, as well as completed papers, that provide theoretical and empirical insights on these and related topics. Proposals are welcome for the May 2023 conference, as well as for the later conference; submissions that are intended for the later conference should clearly indicate that. Papers may analyze policies that are or could be used by government agencies and other data providers, but in keeping with NBER restrictions, they may not make recommendations about the practices these entities should pursue.
Submissions from scholars who are early in their careers, with and without NBER affiliations, and/or who are members of groups that have been historically under-represented in economics, statistics, and computer science are especially welcome.
To be considered for inclusion on the program, papers and paper proposals must be uploaded by midnight Eastern time on Monday, September 12, 2022 to the following website:
http://conference.nber.org/confsubmit/backend/cfp?id=DPs23
Authors chosen to present papers at the first conference will be notified by early October 2022. The organizers will convene a virtual pre-conference meeting with authors in late 2022 or early 2023 to review plans for each of the selected papers and to identify important overlaps and research complementarities.
Please do not submit papers that have been accepted for publication. Research papers presented at the conference will be eligible for distribution in the NBER working paper series and for inclusion in an edited conference proceedings volume that will be published by the University of Chicago Press. Authors will receive a modest honorarium for their participation in the project, subject to timely submission of a revised manuscript for the proceedings volume.
The NBER will cover the cost of two authors per paper attending one of the research conferences; all co-authors will be invited. The conference will be live-streamed to expand dissemination of the research findings. Questions about this conference may be addressed to confer@nber.org.