My GPT-evaluator got 1000% better with this simple trick.

Rifx.Online
Natural Language Processing , Machine Learning , Generative AI
19 Jan, 2025

I wish I had known this trick sooner.

All my articles are free to read. Non-members can read for free by clicking this link.

Last summer, I interned at Adobe Research. Just a few weeks into the project, I was stuck.

It seemed like I would never get the project to work, because it was missing one very crucial detail that I wasn’t able to figure out:

An effective evaluator.

Evaluation the process of assessing a model’s performance based on some predefined criteria*.*

In my specific project, I needed to evaluate whether the LLM’s outputs were faithful to the context or not.

A simple approach would be to check if the ground truth answer is within the model’s output. For example, if the ground truth answer is “Apple”, this could be the condition:

But that won’t always work, like in the example we saw above. The output contains the word “Apple”, yet the overall answer isn’t faithful to the context.

Typically, in situations like this, there’s one thing that most people turn to: GPT-evaluations.

I wrote another blog previously about why I think GPT-eval is overrated and often doesn’t work:

Still, I figured some form of GPT-evaluation was probably needed to evaluate my open-ended generations.

FYI: Open-ended generations are just long form generations of LLMs rather than single-token or single-word predictions.

Little did I know that my GPT-evaluator wouldn’t work right out of the box.

A Naive Solution (my first approach)

First, I tried to implement a GPT-evaluator which took as input the context, the question and the output from the model. Here’s a visual representation of what I inputted into GPT-4.

Effectively, we use the context, question and model output and integrate them into a prompt that I pass into GPT-4 to determine whether or not the model is faithful to the context.

I feel naïve looking back, since I expected it to be this straightforward. When I actually tried it on about 100 example outputs, I went through some of the scores that the model gave and found that it almost always gave a wrong score.

The scoring was so bad that I decided to just manually evaluate all 100 examples every time I tried an experiment.

Eventually though, we decided to increase the number of examples to close to 1000 to ensure our accuracy scores were statistically significant. At this point, performing manual evaluations for every experiment was impossible.

I thought the project was going to come to an end, since evaluations were so hard and none of the existing libraries (like DeepEval) or methods were working.

But what I tried next saved the day. The evaluations became almost 100% accurate.

The method that worked

I created a few-shot prompt with multiple examples.
I provided the ground truth answer along with the prompt.

Few-shot prompting

Few-shot prompting goes by the name “In-Context Learning” (ICL) in some papers. Here’s a self-explanatory definition of the term:

Few-shot prompting is a technique that uses a small number of examples within the prompt that guides the LLM to perform a certain task.

I’ll give you a simple example.

Let’s say we want GPT to output 1 if the given sentence is “happy” and 0 if the sentence is “sad”. This is a valid few-shot prompt that could be inputted into GPT-4 to achieve that:

In my specific case, I gave about 10 context-question-output triplets along with the expected evaluation score (0 or 1) for each of them.

After creating a few-shot prompt, the model started doing better. I was going to stop there, but then I realised I missed something really obvious.

Providing the ground truth answer

The dataset I was using also provided the ground truth answer. In the example we showed earlier, the ground truth answer would’ve been “Apple”.

Somehow, after doing this, I ended up getting essentially all of my output scores aligning with what I expected.

While this may seem trivial, it isn’t — some unfaithful examples had the ground truth answer word somewhere in the output, but GPT was still able to deduce that the wording wasn’t faithful.

Concluding advice

Here are some things I learnt from this exploration:

GPT-evaluations have limitations. They certainly don’t always work, and it might take some experimentation to find out a method that works for your specific use case.
Few-shot prompting helps a lot. By providing multiple, diverse examples, accuracy of GPT’s final scores can be improved significantly.
Provide as much information as possible. The more information you provide, the easier it is for GPT to evaluate. In our case, we happened to have the ground truth answer with us, making it easier for GPT to evaluate.

GPT tends to perform poorly in situations where the question is ambiguous. For example, asking it to rate a paragraph’s fluency on a scale of 1–10 is hard. It doesn’t know how to consider multiple factors at once and give a numerical rating at the end that’s consistent across the examples it sees.

Overall, it seems helpful to think of GPT as a human being. Consider what kind of information it might find useful that would help it figure out the score as fast as possible.