How many few shot examples should you use? | Empirical Prompt Engineering

Welcome back to our series on prompt engineering. We are in the middle of a series of posts about few-shot examples (otherwise known as “in-context learning”), where we examine ways to optimize the addition of few-shot examples to your prompts. In previous posts, we’ve looked at how dependent prompt performance is on which specific examples you select for few-shot, and at how likely it is that a successful set of few-shot examples in one model will perform well in another model. In both of those posts, we used Libretto to generate and test a ton of different few-shot prompt variants, but in all cases, there were exactly three few-shot examples in every prompt variation.

‍

So today’s question is: how does the number of few-shot examples change the performance of a prompt?

‍

There is some evidence that adding more few-shot examples to a prompt that already has few-shot examples can increase prompt performance, but we assume that there’s going to be a limit to this technique’s effectiveness; it can’t just get better and better forever as you add more examples. Stated a little more formally, for some number N, after the LLM has seen N examples in a prompt, it probably won’t help to give it N+1 examples because there won’t be much new information in the N+1th example. But are these assumptions actually true? And if they are, what is N? And is N different for different models and prompts? Knowing the answer to these questions will help us not only make our prompts perform at peak accuracy, but it will also help us avoid unnecessary examples and thereby keep our token usage as low as possible, helping with both cost and latency.

‍

To test out this question, I once again used our trusty Emoji Movie dataset from the Big Bench LLM benchmark. As a refresher, Emoji Movie is a set of 100 questions asking the LLM to come up with a movie being described by a string of emojis. For example:

‍

Q: What movie does this emoji describe? 🧜‍♀️❤️🤴

A: the little mermaid

‍

For this test, we first looked at our previous Emoji Movie tests and chose the 9 highest performing few-shot examples, and we then used those to make 10 variants of the prompt, each with a different number of few-shot examples from 0 to 9. Each prompt in the series was additive to the one before it. For example, the prompt with 7 few-shot examples had all of the examples from the 6 few-shot prompt plus one more tacked on at the end. We then used Libretto, our tool for testing, monitoring, and experimenting with LLM prompts, to run all of the test cases through each of the 10 prompt variants 6 times using GPT 3.5 Turbo, version 0613. Here’s what the results looked like:

‍

‍

To me, this gives a pretty clear picture: the first few-shot example is incredibly useful, bumping performance up by around 10 percentage points, and every few-shot example after that is of dubious utility. If I squint a little, I can sort of convince myself that maybe the 8th and 9th few-shot example might have slightly better performance than the other variants, but the 95% confidence range on those two results overlaps with almost all of the data points from 1 to 7 few-shot examples, so in the end I’m not convinced.

‍

So is the lesson to just use one few-shot example and don’t worry about using more? Not so fast. Although I feel pretty good about these results for this particular test, there are a lot of unanswered questions, including whether this finding replicates with other prompts and other models.

‍

We’ll leave looking at other prompts to another day, but for now let’s quickly look at this same experiment in two other models: Google’s Gemini Pro and Anthropic’s Claude 2.1. Here’s the results from running the same 10 prompt variants through all 100 test cases 6 times in Gemini Pro:

‍

Again, this chart tells a pretty clear story: adding more few-shot examples helps up until example 3 or 4, and then adding more examples doesn’t help. Performance plateaus around example 4.

‍

This is a pretty different result than we saw with GPT 3.5 Turbo 0613 above, where performance leveled off at example 1. However, it’s also worth noting that we are using the same few-shot examples that we used above, which were the ones that had previously shown the best performance for GPT 3.5 Turbo 0613. It’s possible that if we chose the few-shot examples that had performed best for Gemini Pro, we would have seen the performance level off sooner, but we won’t know that until we try it out, which we’ll plan to do in a later post.

‍

Finally, let’s look at this same set of prompts fed to Anthropic's Claude 2.1:

‍

Well, now this result is just downright confusing. Adding these few-shot examples to the prompt seems not to help or even maybe makes performance a bit worse in Claude. We’ve seen that few-shot examples can sometimes hurt prompt performance (see our first prompt engineering post for a particularly devious example), so this isn’t completely unheard of, but it is at the very least surprising, and it definitely merits some further investigation in a follow-on post.

‍

So, what are the conclusions here about how to think about the number of few-shot examples? Honestly, I think we should be pretty circumspect, given that I only tried one kind of prompt and that I used the few-shot examples that were the best for GPT 3.5 to construct the prompt variations.

‍

One lesson I think we can take from this, though, is that it is pretty common for the first few few-shot examples to help out a prompt and then for performance to level off as you add more examples. This pattern is not guaranteed, but it is common. And the other lesson I take from this is that the number of few-shot examples you need before performance levels off is probably heavily dependent on the combination of the specific prompt, model, and examples you are using, so empirical testing is crucial. You can’t know a priori how many few-shot examples is going to give you the best performance with the fewest tokens.

Doing this kind of manual testing to find the perfect set of few-shot examples is tedious, which is why we’re building Libretto to allow you to automatically run dozens of experiments and thousands of tests, helping you automate prompt optimization with the click of a button. If you’d like a demo of Libretto or early access to the product, or if you want us to run more or different experiments, let us know at hello@getlibretto.com . Thanks for reading, and see you next time!