Artificial Writing and Automated Detection
Artificial intelligence (AI) tools are increasingly used for written deliverables. This has created demand for distinguishing human-generated text from AI-generated text at scale, e.g., ensuring assignments were completed by students, product reviews written by actual customers, etc. A decision-maker aiming to implement a detector in practice must consider two key statistics: the False Negative Rate (FNR), which corresponds to the proportion of AI-generated text that is falsely classified as human, and the False Positive Rate (FPR), which corresponds to the proportion of human-written text that is falsely classified as AI-generated. We evaluate three leading commercial detectors—Pangram, OriginalityAI, GPTZero—and an open-source one —RoBERTa—on their performance in minimizing these statistics using a large corpus spanning genres, lengths, and models. Commercial detectors outperform open-source, with Pangram achieving near-zero FNR and FPR rates that remain robust across models, threshold rules, ultra-short passages, "stubs" (≤ 50 words) and ’humanizer’ tools. A decision-maker may weight one type of error (Type I vs. Type II) as more important than the other. To account for such a preference, we introduce a framework where the decision-maker sets a policy cap—a detector-independent metric reflecting tolerance for false positives or negatives. We show that Pangram is the only tool to satisfy a strict cap (FPR ≤ 0.005) without sacrificing accuracy. This framework is especially relevant given the uncertainty surrounding how AI may be used at different stages of writing, where certain uses may be encouraged (e.g., grammar correction) but may be difficult to separate from other uses.