Do LLMs want us to say the magic word? | Empirical Prompt Engineering

One of the earliest and most fundamental lessons in social interactions we give children is that they have to say “please”. I still remember when I was little and would ask (or demand) something, and my parents would stop me until I said “the magic word”; I myself have performed this little lesson with my nieces and my friends’ children, and it undoubtedly happens hundreds of thousands of times a day.

‍

If you hang out with folks who write and think about prompt engineering, you’ll hear a lot about magic words, as well. Sometimes, it feels like every week there’s a new trick of a simple phrase you can add to the end of a prompt to make it more accurate. A paper from September of 2023 found that adding “Take a deep breath, and let’s work through this step-by-step” helped a particular prompt perform better. A Twitter user claimed that offering to tip GPT-4 makes it give longer answers. The 2022 paper that arguably set off the idea with magic words showed increased performance by just adding “Let's think step by step”.

‍

But if you want to operationalize these insights, you first have to test if they really are true. Here at Libretto, we’re working to turn prompt tinkering into prompt engineering, and so today we’re going to examine the question: Does adding magic words to the end of a prompt really help it perform better?

‍

To test this proposition, we turned to Libretto’s new Experiments feature. Experiments is a testing platform that lets you automatically create multiple variations of your prompt and run all of them against your test inputs to see if any of the variations provide a discernible lift in accuracy. For this experiment, I started by testing a subset of the MultiEmo task from Google’s excellent Big Bench LLM test suite. This task is a sentiment analysis task where the LLM is presented with a review of a product and has to classify the review as positive, negative, neutral, or ambivalent.

‍

For this experiment, I took 198 labeled test cases of hotel reviews in English and fed them into Libretto. First, I did a few minutes of manual iteration in Libretto on the prompt in GPT-3.5 Turbo to fix some simple issues, primarily that GPT-3.5 Turbo didn’t at first understand that the task required it to answer completely in lowercase. Once I’d done that, I got a rough, simple prompt that was correctly classifying about 75% of the test cases. It read:

‍

“The user is going to read out to you a review of a hotel, and you need to decide whether the overall sentiment of the review is positive, neutral, negative, or ambivalent. Just answer with one of the four words IN ALL LOWERCASE: positive, neutral, negative, ambivalent.”

‍

While this is certainly not the most optimized prompt I could come up with, it was a good starting point for working on an Experiment about magic words. We recently created a Static Magic Words Experiment feature here at Libretto, and I was eager to give it a spin. The Static Magic Words Experiment creates 18 variations of your prompt, adding phrases to the end of your system message like “You’d better be sure” or the aforementioned “Take a deep breath and work on this problem step-by-step”. The Magic Words Experiments then tests each of the 18 variations against the entire test set 10 times and reports on the results.

‍

For the MultiEmo English hotel review task, GPT-3.5 Turbo didn’t really seem to like any of the magic words particularly. Here are the results, with the baseline unaltered prompt at far left, and the 18 Magic Word variants, charted with percent accuracy on an exact match string evaluation and 95% confidence intervals:

‍

There’s not a lot to say about this chart, other than that there doesn’t seem to be any clear effect of adding the various magic words to the prompt. Stepping back, I don’t find this entirely surprising, as sentiment analysis is not a super cognitively demanding task, so giving the LLM encouragement to be careful or thoughtful might not help much. But this did raise the question: what about other types of prompts?

‍

Continuing the exploration, I turned to the Object Counting test set in Big Bench. This is a prompt that tells the LLM about a list of objects in text and then asks how many objects or objects of a particular type were just named. An example test looks like this:

‍

Q: I have a cow, a garlic, a stalk of celery, a cabbage, an onion, a lettuce head, a snail, a carrot, a potato, three heads of broccoli, a yam, and a cauliflower. How many vegetables do I have?

‍

A: 12

‍

I followed the same basic experimentation pattern with 160 test cases from the Object Counting dataset, starting with a small amount of preliminary iteration on the prompt in Libretto to get it to return numerals fairly dependably. This gave me the prompt:

‍

"The user is going to tell you about a list of objects they have and then ask you how many of a particular object they have. Answer with numerals only, like "6" or "52"."

‍

I then ran this prompt through our Static Magic Words Experiment, running each of the 18 magic word variants against 160 test cases 10 times each. The results were a bit more interesting (in the chart below, blue is baseline, variations that over perform baseline at a 95% confidence are green and variations that underperform baseline at a 95% confidence are red):

‍

First off, I noticed that five of the 18 variations achieved significance at the 95% confidence level; two were better than the baseline, and three were worse. This suggests that at least some of the time, magic words can have a measurable effect.

‍

Second, one of the variations (Variation 1: “Take a deep breath and work through the problem step-by-step”) was wildly worse than every other prompt, scoring about 19 percentage points worse than baseline. Whenever I find a result that’s much more dramatic than what I expected, I want to dig down a bit and try to figure out what’s going on. Looking at the individual test results in Libretto, it seems that Variation 1’s magic words counteracted the prompt instruction to only answer in numerals. This snippet of the test viewer in Libretto paints the picture:

‍

This is a small portion of a screenshot of Experiment test results from Libretto showing three individual test cases. The first column is the test case question, the second column is the expected answer, the third column is how the baseline prompt answered, and the fourth column is how Variation 1 answered. You’ll see that in terms of content accuracy, Variation 1 actually got all three questions right, while the baseline only got two. But because Variation 1 answered in sentences, it actually missed two of the test cases.

‍

Scanning through the results, it turns out that about 35 of Variation 1’s failures were tests where Variation 1 got the correct numerical answer, but it failed due to the fact that the LLM answered with a sentence, like “You have 5 animals.” rather than just with the number, like “5”. If those 35 failures had been counted as successes, Variation 1 would have performed basically the same as the rest of the variations. Furthermore, the other two prompts that were significantly worse, Variation 6 and Variation 11, also exhibited this tendency to not answer with numbers, and both of them would be average prompts in terms of accuracy if they hadn’t used sentences in their response.

‍

This test made me wonder what would happen if I optimized the baseline prompt a bit more before sending it into the Static Magic Words Experiment. I used Libretto’s manual testing to try out a few new versions of the prompt, specifically using zero-shot Chain of Thought to make a prompt that performed about 20 points better on the test set. The prompt I came up with was:

‍

"The user is going to tell you about a list of objects they have and then ask you how many of a particular object they have. Feel free to think aloud on the problem, working through the logic of the question step-by-step, and put your final answer inside [[double brackets]]. Answer with numbers only, like [[6]] or [[52]]."

‍

When I ran this prompt through the Static Magic Words Experiment, as with the Multiemo hotel experiment, there weren’t particularly significant differences in performance:

‍

So, what conclusions can we draw here?

‍

First, whether or not magic words affect a prompt’s performance is highly variable. Changing a prompt just a bit can radically alter the effect of magic words.

‍

Second, to the extent that magic words affect the performance of your prompt, it’s important to scratch the surface a bit and see if you can see a pattern in the individual test cases and see why the prompt is succeeding or failing more with a particular magic word. It’s possible that (as with the second chart in this article) the magic words are counteracting one of your prompt’s instructions.

‍

Third, magic words are not as much of a slam dunk as you may think if you spend a lot of time on prompt engineering Twitter. They certainly might help with some prompts and with some models, but it’s an empirical question that needs to be tested empirically rather than answered universally.

‍

These results led us to iterate and create a Dynamic Magic Words Experiment, which asks an LLM to come up with better ideas for magic words that might improve a prompts, and we've seen more promising results with that experiment, but we'll leave that for a later post.

‍

If you’re working on prompt engineering and are interested in any of these questions, we’re building Libretto to make this kind of experimentation fast, easy, and fun. We want to make optimization of your prompts as simple as a button press so you can let us do the hard parts of prompt engineering and you can rest easy knowing that you’re getting the most out of LLMs in your code. If you’d like a demo of Libretto or early access to the product, or if you want us to run more or different experiments, let us know at hello@getlibretto.com . Thanks for reading!

‍