Retrieval is more than pure chance
Coldstarting Rag Evaluations
Without a method to evaluate the quality of your RAG application, we might as well be leaving its performance to pure chance. In this article, we'll walk you through a simple example to demonstrate how easy it is to get started.
We'll start by using Instructor to generate synthethic data. We'll then chunk and embed some Paul Graham Essays using lancedb
. Next, we'll showcase two useful metrics that we can use to track the performance of our retrieval before concluding with some interesting improvements to iteratively generate harder evaluation datasets.
Most importantly, the code used in this article is avaliable inside the /code/synthethic-evals
folder. We've also included some Paul Graham essays in the same folder for easy use.
Let's start by first installing the necessary libraries
Generating Evaluation Data
Set your OPENAI_API_KEY
Before proceeding with the rest of this tutorial, make sure to set your OPENAI_API_KEY
inside your shell. You can do so using the command
Given a text-chunk, we can use Instructor to generate a corresponding question using the content of the question. This means that when we make a query using that question, our text chunk is ideally going to be the first source returned by our retrieval algorithm.
We can represent this desired result using a simple pydantic
BaseModel.
Defining a Data Model
class QuestionAnswerPair(BaseModel):
"""
This model represents a pair of a question generated from a text chunk, its corresponding answer,
and the chain of thought leading to the answer. The chain of thought provides insight into how the answer
was derived from the question.
"""
chain_of_thought: str = Field(
..., description="The reasoning process leading to the answer.", exclude=True
)
question: str = Field(
..., description="The generated question from the text chunk."
)
answer: str = Field(..., description="The answer to the generated question.")