Undetectable Watermarks for Language Fashions
Paper 2023/763
Undetectable Watermarks for Language Fashions
Sam Gunn, College of California, Berkeley
Or Zamir, Princeton College
Summary
Latest advances within the capabilities of enormous language fashions comparable to GPT-4 have spurred growing concern about our skill to detect AI-generated textual content. Prior works have prompt strategies of embedding watermarks in mannequin outputs, by $textit{noticeably}$ altering the output distribution. We ask: Is it doable to introduce a watermark with out incurring $textit{any detectable}$ change to the output distribution?
To this finish we introduce a cryptographically-inspired notion of undetectable watermarks for language fashions. That’s, watermarks might be detected solely with the information of a secret key; with out the key key, it’s computationally intractable to differentiate watermarked outputs from these of the unique mannequin. Particularly, it’s unimaginable for a consumer to look at any degradation within the high quality of the textual content. Crucially, watermarks ought to stay undetectable even when the consumer is allowed to adaptively question the mannequin with arbitrarily chosen prompts. We assemble undetectable watermarks primarily based on the existence of one-way features, an ordinary assumption in cryptography.
BibTeX
@misc{cryptoeprint:2023/763, creator = {Miranda Christ and Sam Gunn and Or Zamir}, title = {Undetectable Watermarks for Language Fashions}, howpublished = {Cryptology ePrint Archive, Paper 2023/763}, 12 months = {2023}, word = {url{https://eprint.iacr.org/2023/763}}, url = {https://eprint.iacr.org/2023/763} }