Mining Chinese Historical Sources At Scale: A Machine Learning-Approach to Qing State Capacity
Primary historical sources are often by-passed for secondary sources due to high human costs of accessing and extracting primary information–especially in lower-resource settings. We propose a supervised machine-learning approach to the natural language processing of Chinese historical data. An application to identifying different forms of social unrest in the Veritable Records of the Qing Dynasty shows that approach cuts dramatically down the cost of using primary source data at the same time when it is free from human bias, reproducible, and flexible enough to address particular questions. External evidence on triggers of unrest also suggests that the computer-based approach is no less successful in identifying social unrest than human researchers are.