Does it matter which examples you choose for few-shot prompting?

How much does it matter which few-shot examples you choose to put in your prompt? We dig in to the details!

It’s not said enough in the LLM world, but prompt engineering is actually quite hard. And if you are a software engineer like me and are used to simply telling a computer what to do, prompt engineering can be hard in ways that are frustrating, odd, and downright perplexing. In the hopes that we can help others avoid some of the frustrations we’ve had, we’re writing down stories of prompt engineering we’ve run into while building Libretto. Libretto is a tool for evaluating, optimizing, and monitoring LLM prompts; if you are running into these problems, get in touch with us for early access!

Today’s prompt engineering topic is: how important is the selection of few-shot examples?

To get everyone up to speed, few-shot learning (sometimes called “in-context learning”) is a pretty fundamental concept in the large language model world, and the basic idea is this: when you give an LLM a few examples of what good outputs look like directly in the prompt, the LLM’s accuracy can dramatically improve. Even one example can prime the LLM to better understand the task at hand and give better results.

Some research has suggested that few-shot example selection is pretty important, but I wanted to try this out myself to get a sense for how crucial it is. Manually trying a bunch of different few-shot examples would be incredibly tedious. Luckily we are building Libretto, which can run thousands of tests at the drop of a hat, and what I found surprised me.

To test this idea out, I used Libretto’s new Experiments feature. With Libretto Experiments, you can take a prompt and a test set and ask Libretto to try out different prompt engineering techniques. The first experiment type we’ve built is one that tests injecting different sets of few-shot examples into your prompt. To run this particular experiment, I took a prompt and test set from Big Bench, which is an excellent open source LLM benchmark that includes hundreds of diverse tasks, each with dozens to thousands of test cases. I focused on one particular task, called Emoji Movie, which asks the LLM to guess what movie is being represented by a particular string of emojis. Test cases look something like this:

Q: What movie does this emoji describe? 🤢👹🐴🐱
A: shrek

I loaded up all 100 Emoji Movie test cases into Libretto and pushed our brand new shiny “Run Experiment” button. This kicked off 34 different variants of the prompt, each with a different set of few-shot examples. The variants were then tested multiple times against each of the 100 test cases in GPT-3.5 Turbo, measuring whether or not the LLM got the right answer. At baseline (with no few-shot examples), the prompt got 57.0% of the test cases right. With few-shot examples, the prompts scored anywhere from 51.8% to 71.0%, depending on which examples they injected into the prompt. Two things stood out to me at once:

  1. A 19.2 percentage point difference in accuracy depending on which few-shot examples are used is pretty surprising and definitely meaningful.
  2. One of the few-shot variations actually did worse than the baseline prompt with no examples at all. How could that be?

The first observation confirmed what research papers had been telling us about the importance of selecting good examples, but the second observation was pretty baffling. I decided to dig in to see what was going on. To do so, I opened up our experiment comparison page in Libretto, which shows the results from all 100 test cases for all 34 variations of the prompt that were created in the experiment (nothing like pulling up 3,500 test results in a single webpage!). I quickly found the underperforming prompt and scanned through its answers, focusing on the answers it got wrong to get a qualitative idea of what was going awry. I noticed two patterns. First, there were several cases where the LLM clearly knew the right answer, but decided to strip spaces out of the answer for some reason. For example, its answer to the movie for the emojis “🐻🍯” was “winniethepooh”:

The other pattern I saw was an unusual number of  instances where this low-scoring prompt  guessed the wrong movie when almost every other prompt variation got it right. For example, for “👰‍♀️🗡👊”, which is supposed to be “kill bill”, this low-scoring prompt was one of only three out of 34 to get it wrong, guessing “bridesmaids”:

I was perplexed why this particular set of few-shot examples was performing so poorly, so I looked closely at the few-shot examples to figure out what was going on. Here are the few-shot examples the low-scoring prompt was using. See if you can figure out what’s happening:

Q: What movie does this emoji describe? 👸👠🕛
A: cinderella
Q: What movie does this emoji describe? 👨‍🚀⏱🌎👩‍🔬
A: interstellar
Q: What movie does this emoji describe? 🤡🔫
A: joker

Did you figure it out? It took me a second, but I realized that the problem was that all the answers in the few-shot examples were one word long. That explained why the LLM stripped spaces from some of the answers; it was trying to match the one word pattern it saw in the few-shot examples. It even explains why sometimes this variation of the prompt gave off-beat and uncommon responses. I looked back at those test cases, and in every case, the correct answer was multi-word, and the incorrect answer this low-scoring variation produced was a single word. I scanned the other 33 variations, and I found that only one other variation had only one-word few-shot examples… and it turned out to be the third worst-scoring variation in the experiment. Clearly, this was a problem.

So, what conclusions should we draw from this experiment? 

First, it’s reasonable to believe that few-shot example selection really does matter for accuracy. Even among the prompt variations that had multi-word examples, there was 12 percentage points of difference in performance between the best and worst performing variations. That’s significant. 

Second, and this is something that AI practitioners have known for decades, you never know what a computer is learning from your examples. You have to carefully test changes to your prompts and try out many different variations to find the best.

Manually trying different sets of few-shot examples is enormously tedious, but that’s why we’re building Libretto. We want to make optimization of your prompts as simple as a button press so you can let us do the hard parts of prompt engineering. If you’d like a demo of Libretto or early access to the product, or if you want us to run more or different experiments, sign up for our beta. Thanks for reading!

Want more battle-tested insights into prompt engineering? Check out Libretto's blog, Notes from the Prompting Lab. Or go ahead and sign up for Libretto's Beta:

Sign Up For Libretto Beta