In part 1, we learned how to use Libretto to create a prompt template, generate test cases for our prompt, and run those test cases.
In part 2, we’ll learn how to integrate libretto into our TypeScript app, so that we can get real-world data into libretto, and use it to improve our prompts.
The Libretto Typescript libraries record calls to LLM providers like OpenAI or Anthropic, sending the results to Libretto as Events. Inside the Libretto app you can turn these Events into Test Cases with just a click. This makes it incredibly easy to build up a library of test cases based on real-world traffic.
We’ll continue to use our demo app, WikiDate, as a real-world example.
.env
file:LIBRETTO_API_KEY=XXXX
npm install @libretto/openai
src/util/profile.ts
:- import OpenAI from "openai";
+ import { OpenAI, objectTemplate } from "@libretto/openai";
${variableName}
with {variableName}
. const datingProfileV1 = {
promptTemplate: [{
role: "system",
content:
"You are a dating guru and are here to help create dating profiles based on the provided persons wikipedia page.",
},
{
role: "user",
content: Using the following Wikipedia content,
create a dating profile for the subject of this page, called "${name}".
${wiki_text}.
}],
};
With this:const datingProfileV1 = {
promptTemplate: [{
role: "system",
content:
"You are a dating guru and are here to help create dating profiles based on the provided persons wikipedia page.",
},
{
role: "user",
content: Using the following Wikipedia content,
create a dating profile for the subject of this page, called "{name}".
{wiki_text}.
}],
};
Finally, update the call to the chat completion, using objectTemplate() and the following Libretto configuration:openai.chat.completion.create({
model: “gpt-4o”,
// send the template with objectTemplate(),
// rather than the raw prompt
messages: objectTemplate(datingProfileV1.promptTemplate),
tools: tools,
response_format: { type: “json_object”},
libretto: {
// uniquely identify this prompt, matching the key created
// in part 1.
promptTemplateName: “wiki-dating-profile”,
// pass along actual values
templateParams: { wiki_text, name },
},
});
That’s it!
Now we can load up the app, clicking the ”Surprise me!” button a few times to generate a few dating profiles.
Back in Libretto, click on the “Production Calls” section to see these events.
Tip: If you’re already on the page, you can click the little refresh button in the upper right corner of the table.
You’ll see the events in the table:
We now have real production data that we can immediately use to enhance our test cases.
We can turn this production event into a test case by simply using the dropdown menu. Select “Edit & Add to Tests”.
Here you are given the opportunity to adjust the test case before saving it.
For this prompt there is no one "correct answer". As you may recall from Part 1, Libretto generated other evaluations, such as “Accurate Age Calculation”. This means that we do not need the “Correct Answer” section. Scroll down to the Correct Answer section, and click the “X” in the “Function Names” section to clear the call.
Click Add Test Case, and then go to the “Test Cases” page to see how this new test has been integrated into our current suite of tests.
Go back to the Playground, we can now try all of our test cases against a other models, to compare how they perform.
Click on the “Playground” link on the left. Now try running against a few models
In the “Tests” panel on the right, you’ll see the various test runs we’ve tried, with different versions and different models. Select a few of these rows by clicking the checkboxes, and then click “Compare”
In the report view, you can see how the different models perform using different LLMs. When we ran it we saw subtle differences between GPT 3.5 Turbo and Claude 3 Haiku. Our data includes the Kazakh vollyball player “Inna Matveyeva”. For this test case:
These are only single test cases, so we’ll need to look at the rest of the test results to see if one model is consistently better or not, but the headers can give us some clues before we start combing through the data. In the test cases that we ran, it looks like Claude 3 Haiku is averaging better for both of these evals, though GPT 3.5 Turbo is generally faster:
The “current year” problem exists in both models: Claude 3 Haiku was accurate 8.3% of the time while GPT 3.5 Turbo was never accurate. This could be fixed by simply including the current year in the prompt.
Go back to the playground. Add “The current year is 2024” to the system message, and add “based on the current year” to the line in the user prompt about “age”. Click “Save & Run Tests”.
Now re-select the other model(s) that you want to test at the top (in our case, GPT 3.5 Turbo and Claude 3 Haiku)
There are now at least two new entries in the Tests panel, for the new runs against this new version of the prompt. Compare these outputs. In our cases, this improved accuracy for both models by quite a bit:
Even though our prompt isn’t perfect, it is still an improvement from where we were before. Lets take this model back to our code so we can deploy it to our users. Go back to the Playground. Click the Clipboard button near the bottom to get the updated prompt as JSON:
Paste the JSON back into the demo code in src/prompts/dating-profile-v1.ts
. The next time that WikiDate generates a profile, it will start using this new prompt.
Now you know how to integrate Libretto into your production code, and how to use real production data to create tests so that you can improve your prompt while testing it against real-world uses.
In part 3, we’ll use some of the evals that Libretto provides, including the new LLM-as-Judge evals that we introduced recently.
Want more battle-tested insights into prompt engineering? Check out Libretto's blog, Notes from the Prompting Lab. Or go ahead and sign up for Libretto's Beta: