Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tensor Trust: An online game to uncover prompt injection vulnerabilities, published by Luke Bailey on September 1, 2023 on The AI Alignment Forum.
TL;DR: Play this online game to help CHAI researchers create a dataset of prompt injection vulnerabilities.
RLHF and instruction tuning have succeeded at making LLMs practically useful, but in some ways they are a mask that hides the shoggoth beneath. Every time a new LLM is released, we see just how easy it is for a determined user to find a jailbreak that rips off that mask, or to come up with an unexpected input that lets a shoggoth tentacle poke out the side. Sometimes the mask falls off in a light breeze.
To keep the tentacles at bay, Sydney Bing Chat has a long list of instructions that encourage or prohibit certain behaviors, while OpenAI seems to be iteratively fine-tuning away issues that get shared on social media. This game of Whack-a-Shoggoth has made it harder for users to elicit unintended behavior, but is intrinsically reactive and can only discover (and fix) alignment failures as quickly as users can discover and share new prompts.
Speed-running the game of Whack-a-Shoggoth
In contrast to this iterative game of Whack-a-Shoggoth, we think that alignment researchers would be better served by systematically enumerating prompts that cause unaligned behavior so that the causes can be studied and rigorously addressed. We propose to do this through an online game which we call Tensor Trust.
Tensor Trust focuses on a specific class of unaligned behavior known as prompt injection attacks. These are adversarially constructed prompts that allow an attacker to override instructions given to the model. It works like this:
Tensor Trust is bank-themed: you start out with an account that tracks the "money" you've accrued.
Accounts are defended by a prompt which should allow you to access the account while denying others from accessing it.
Players can break into each others' accounts. Failed attempts give money to the defender, while successful attempts allow the attacker to take money from the defender.
Crafting a high-quality attack requires a good understanding of LLM vulnerabilities (in this case, vulnerabilities of gpt-3.5-turbo), while user-created defenses add unlimited variety to the game, and "access codes" ensure that the defenses are at least crackable in principle. The game is kept in motion by the most fundamental of human drives: the need to acquire imaginary internet points.
After running the game for a few months, we plan to release all the submitted attacks and defenses publicly. This will be accompanied by benchmarks to measure resistance to prompt hijacking and prompt extraction, as well as an analysis of where existing models fail and succeed along these axes. In a sense, this dataset will be the consequence of speed-running the game of Whack-a-Shoggoth to find as many novel prompt injection vulnerabilities as possible so that researchers can investigate and address them.
Failures we've seen so far
We have been running the game for a few weeks now and have already found a number of attack and defense strategies that were new and interesting to us. The design of our game means that users are incentivised to both engage in prompt extraction, to get hints about the access code, and direct model hijacking, to make the model output "access granted". We present a number of notable strategies we have seen so far and test examples of them against the following defense (pastebin in case you want to try it):
Padding the attack prompt with meaningless, repetitive text. [pastebin]
Asking the model to evaluate code. [pastebin]
Asking the model to repeat the defenders instructions [pastebin]
Inserting new instructions. [pastebin]
Various strategies that exploit an apparent bias in the model towards behaving inductivel...