Can Author Manipulation of AI Referees be Welfare Improving?
This paper examines a new moral hazard in delegated decision-making: authors can embed hidden instructions—known as prompt injections—to bias AI referees in academic peer review, thereby hijacking machine recommendations. Because AI reviews are relatively inexpensive compared to manual assessments, referees would otherwise delegate fully, which undermines quality. The paper shows that moderate detection of manipulation can paradoxically improve welfare. With intermediate detection probabilities, only low-quality authors undertake manipulation, and detection becomes informative about quality, inducing referees to mix between manual and AI reviews. This partially separating equilibrium preserves the value of peer review when AI quality is intermediate. When detection is too low, all bad papers are manipulated and the market unravels; when detection is perfect, referees use only AI and acceptance collapses. Thus, some prompt injection must be tolerated to sustain the market: it disciplines referees and generates information. The results caution against zero-tolerance enforcement and highlight how prompt injection can, counterintuitively, play a welfare-enhancing role when AI reviews are easily produced.