Collaborative Research: Large-scale Database Construction of Firms' Organizational Form, Competition, and Industry Change
Project Outcomes Statement
The goal of this project is to build a large-scale database, which we call the Web Text-based Network Industry Classification (WTNIC), of data about both public and private companies in the US. This large-scale database provides information on the competitive network of 950,000 firms over the last 20 years. This database is constructed using firm archived web pages from the Internet Archive Wayback Machine project. This database is also linked to the last 15 years of patent data from the U.S. Patent and Trademark Office.
As part of this project, we explored the efficacy of an array of potential computational linguistic methods to generate the most informative network of peers. We conducted all tests on a subsample of 4000 firms and we optimized the informativeness of the network as measured using regressions testing the network’s ability to explain firm profits. The team developed algorithms that have been run on all 950,000 WTNIC firms over all of the years of our sample from 1997 to 2016. The research team examined numerous methods for computing these similarities. The team explored neural network embedding technologies implemented using word2vec and doc2vec. The result of this analysis on the WTNIC data is that doc2vec offers the best results thus far both in terms of informativeness and scaling.
In preparation for launching the final WTNIC database, the team explored techniques to link the companies in the WTNIC network database to a number of frequently used financial economics databases. This step is necessary to realize many broader impacts including the empowerment of a much wider array of research in financial economics. Specific databases that we linked to include the USPTO patent database, the Venture Expert VC funding database, the CRSP and Compustat public firm databases, and the underlying WTNIC firm list. The team experimented with many algorithms over the duration of this project including Stata, RLTK, edit distances, and deduplication and matching programs written in Python. After much examination and testing for accuracy, the research team identified the python DEDUP algorithm as the optimal tool for this purpose as it is both accurate and scales to the required size.
This project will have an impact across multiple disciplines. The issues of market structure and product market competition are important issues in many disciplines. For example, empirical studies in the fields of economics, finance, accounting, and management examine market structure and competition. This practice is especially difficult when studies include private firms given the lack of reliable industry data for these firms. Currently most researchers rely on Census produced Herfindahls, which are only available for manufacturing industries. The new data developed in this project can improve these strands of industry research and can further impact researchers in multiple disciplines through the methodological contributions. The study of entrepreneurship will in particular be aided by the data linked from web pages to data from the U.S. Patent and Trademark Office. Researchers studying entrepreneurship will be able to assess the impact of patents on industry structure and evolution.
The WTNIC repository has been used as a key component of the Business Open Knowledge Network (BOKN) project (NSF grant # (#1937153), a separate NSF funded project associated with the NSF Convergence Accelerator program. Attesting to the value of the WTNIC repository, the most exciting broader impacts of the follow-on project draw heavily from the tools developed for the WTNIC project.
Supported by the National Science Foundation grant #1561068
More from NBER
In addition to working papers, the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter, the NBER Digest, the Bulletin on Retirement and Disability, the Bulletin on Health, and the Bulletin on Entrepreneurship — as well as online conference reports, video lectures, and interviews.