Synthethic Data¶

2024/06/01
in RAG, Synthethic Data, LLM
19 min read

Retrieval is more than pure chance

Coldstarting Rag Evaluations

Without a method to evaluate the quality of your RAG application, we might as well be leaving its performance to pure chance. In this article, we'll walk you through a simple example to demonstrate how easy it is to get started.

We'll start by using Instructor to generate synthethic data. We'll then chunk and embed some Paul Graham Essays using lancedb. Next, we'll showcase two useful metrics that we can use to track the performance of our retrieval before concluding with some interesting improvements to iteratively generate harder evaluation datasets.

Most importantly, the code used in this article is avaliable inside the /code/synthethic-evals folder. We've also included some Paul Graham essays in the same folder for easy use.

Let's start by first installing the necessary libraries

pip install instructor openai scikit-learn rich lancedb tqdm

Generating Evaluation Data

Set your OPENAI_API_KEY

Before proceeding with the rest of this tutorial, make sure to set your OPENAI_API_KEY inside your shell. You can do so using the command

>> export OPENAI_API_KEY=<api key>

Given a text-chunk, we can use Instructor to generate a corresponding question using the content of the question. This means that when we make a query using that question, our text chunk is ideally going to be the first source returned by our retrieval algorithm.

We can represent this desired result using a simple pydantic BaseModel.

Defining a Data Model

class QuestionAnswerPair(BaseModel):
    """
    This model represents a pair of a question generated from a text chunk, its corresponding answer,
    and the chain of thought leading to the answer. The chain of thought provides insight into how the answer
    was derived from the question.
    """

    chain_of_thought: str = Field(
        ..., description="The reasoning process leading to the answer.", exclude=True
    )
    question: str = Field(
        ..., description="The generated question from the text chunk."
    )
    answer: str = Field(..., description="The answer to the generated question.")